AI POC Requirements: What to Define Before Building a Proof of Concept

Why do most AI POCs fail to convert to production?

The most common reason AI POCs fail to become production systems is not technical — it is definitional. The POC was built to demonstrate that “AI can help” rather than to answer a specific, measurable question. Without a clear question, a successful POC produces results that stakeholders interpret three different ways: the data science team sees a working model, the business team sees uncertain ROI, and the engineering team sees an undeployable prototype.

Defining requirements before building prevents this misalignment. The requirements document forces agreement on what question the POC answers, what data it uses, what success looks like, and what happens if it succeeds. That alignment is the actual deliverable of the requirements phase — the document is just the artifact that records it.

In our experience, the teams that consistently convert POCs to production are not the ones with the strongest data scientists. They are the ones that treat the requirements phase as a contract negotiation between stakeholders rather than a writing exercise.

What requirements must be defined?

Requirement	What to specify	Common failure if missing
Business question	Specific, measurable hypothesis	POC answers the wrong question
Success metric	Quantified threshold (accuracy, latency, cost)	No way to evaluate success
Data access	Specific tables, APIs, access credentials	Multi-week delay getting data access
Scope boundary	What is in scope and explicitly out of scope	Scope creep delays delivery
Timeline	Fixed end date with checkpoints	POC runs indefinitely
Decision framework	What actions follow each possible outcome	Results sit in a report, unused

The decision framework is the most frequently omitted requirement and the most important. It answers a pair of questions: if the POC succeeds, what decision does the organisation commit to making? If it fails, what does the organisation do instead? Without a pre-committed answer to both, a successful POC generates enthusiasm but no action. The organisation has not pre-committed to the investment required to productionise the result, and the model — however good — quietly becomes a slide deck.

Quantified thresholds matter for the same structural reason. An observed pattern across our POC engagements: when success criteria are vague, evaluation drifts toward whichever interpretation suits the loudest stakeholder. When they are numeric and pre-registered, the result speaks for itself.

How do you scope a POC correctly?

A well-scoped AI POC has three boundaries.

Data boundary — which data sources, time periods, and customer or product segments are included, and which are deferred. Production datasets often span years and dozens of source systems; a POC needs a representative slice, not a complete one.

Model boundary — which approaches will be evaluated (for example, a gradient-boosted baseline plus a transformer-based contender) and which are explicitly excluded. Without this boundary, the POC team chases novel architectures instead of validating the underlying hypothesis.

Evaluation boundary — which metrics, test datasets, and comparison baselines define success. This is where most evaluation disputes originate, so it is worth fixing in writing before any model code is written.

In our POC design practice, we typically spend 20–30% of the total POC timeline on requirements definition and data exploration before any model development begins. This is an observed range across our engagements rather than a benchmarked rate — your number will depend on data accessibility and stakeholder count. The investment feels slow but prevents the most expensive failure mode: building a technically successful model that the organisation cannot or will not deploy.

For the broader context of how POC design fits into the rest of an AI engagement, our companion piece on what an AI POC should actually prove covers the evaluation methodology in detail, and the POC success-criteria walkthrough sits alongside this one as the criteria-design counterpart.

What does a good POC requirements document look like?

A minimum POC requirements document is 2–3 pages and contains:

Problem statement — one paragraph describing the business problem in concrete terms, not in AI vocabulary.
Success criteria — 2–3 quantified metrics with pass/fail thresholds, evaluated on a pre-specified dataset.
Data specification — exact data sources, access method (database credentials, API tokens, file drops), and known quality issues already documented from the data audit.
Scope — explicit inclusions and exclusions, with the exclusions written as positively as the inclusions.
Timeline — start date, checkpoint dates, end date. Typical total: 4–8 weeks.
Decision matrix — what action follows each outcome (exceed threshold, meet threshold, fall below threshold).
Resource requirements — team members, compute, data access, stakeholder availability for reviews.

We share this document with the full stakeholder group — business sponsor, data science team, engineering team, data owners — and require sign-off before development begins. The sign-off process surfaces disagreements and misaligned expectations before they become expensive mid-POC discoveries. A two-hour requirements review is the cheapest place in the entire engagement to discover that the business sponsor expected real-time inference while the engineering team scoped a nightly batch.

This document is also the first packageable artifact of the engagement. Even if the POC stops early, the data specification and decision matrix remain useful to the organisation: they document what was investigated, what data exists, and under what conditions a future attempt should proceed.

What are the most common POC requirement mistakes?

Across the AI POC specifications we have reviewed, four mistakes appear repeatedly. Each one is cheap to fix in the requirements document and expensive to fix later.

Defining success as “high accuracy” without a number. “The model should be accurate” is not a success criterion. “The model should achieve ≥85% precision and ≥80% recall on the held-out test set, evaluated on 500+ samples” is. Without quantified thresholds, POC evaluation becomes subjective — and subjective evaluation is influenced by organisational politics rather than technical merit. We pay close attention to this when reviewing draft requirements, because a single missing number tends to cascade into weeks of post-POC argument.

Using production data volumes in the POC. A POC should demonstrate feasibility, not production readiness. Using the full production dataset — millions of records, terabytes of data — extends the POC timeline and obscures the core question: does this approach work on representative data? We typically scope POC datasets to 10,000–50,000 representative samples (an observed-pattern range, not a fixed rule) — enough to evaluate model performance statistically, small enough to iterate quickly on a single workstation or a modest cloud instance.

Omitting the negative case. What happens if the POC fails to meet the success criteria? Without a pre-defined “fail” response, organisations tend to extend the POC indefinitely. Maybe we just need more data. Let’s try a different model. Resources are consumed without converging on a decision. Our POC requirements include a time-box: if the success criteria are not met by the end date, the POC is considered unsuccessful and the decision matrix specifies the next action — abandon, pivot, or extend with specific scope changes and a new end date.

Not specifying the evaluation dataset in advance. If the evaluation dataset is selected after the model is built, there is a risk — conscious or unconscious — of choosing data that makes the model look good. We specify the evaluation split (or the splitting methodology, for time-series problems where a stratified random split would leak) in the requirements document before any model development begins. This prevents evaluation bias and ensures that the POC results are credible to stakeholders who did not participate in the development. For engineering teams already using frameworks like PyTorch or scikit-learn, the mechanics are trivial; the discipline is what matters.

In our POC practice, we conduct a requirements review meeting before development starts. This meeting walks through each requirement with the full stakeholder group, identifies ambiguities, and resolves disagreements. It typically takes two hours and saves two to four weeks of mid-POC rework. It is, by a wide margin, the highest-leverage meeting in the entire engagement.

FAQ

What should an AI proof of concept actually prove before an organisation commits to a full build?

A POC should prove that the highest-risk assumptions hold under conditions resembling production: that the data is accessible and clean enough to support the model, that the model meets pre-registered numeric thresholds on a pre-specified evaluation set, that integration with the surrounding systems is feasible, and that the business value is measurable. Proving “the model works on curated data” is not sufficient.

What is the difference between a demo, a prototype, and a POC — and why does each fail at a different stage?

A demo shows that an AI capability exists in principle — it fails when stakeholders mistake it for evidence of fit. A prototype shows that the capability can be built in your environment — it fails when integration realities surface late. A POC is designed to answer a specific go/no-go question against pre-registered criteria — it fails when the question, the criteria, or the decision framework are not defined up front. The three are not stages of the same artifact; they answer different questions.

Which evaluation evidence must come out of a POC to be useful downstream?

At minimum: a documented data lineage (where the training and evaluation data came from, how it was split, what was excluded), a performance envelope on the pre-specified evaluation set (not just headline accuracy but latency, throughput, and failure-mode characterisation), and an integration-risk assessment naming the systems the production version would touch. These three artifacts remain useful even if the project does not proceed.

What is the realistic failure rate of AI POCs, and which scoping choices drive it?

Industry surveys from Gartner and similar analyst groups have repeatedly placed AI POC-to-production conversion failure in the majority — these are directional industry-scale figures rather than operational benchmarks for any given organisation. The scoping choices that drive failure are consistent across studies: absence of a quantified success criterion, absence of a decision framework, evaluation datasets chosen after the fact, and a missing time-box.

When does a POC need a clean kill criterion, and how should that be defined up front?

Every POC needs one. The kill criterion is the negative half of the decision matrix: a specific numeric threshold below which the POC is considered unsuccessful, plus a specific action (abandon, pivot, or scoped extension) that follows. Defined up front, in writing, signed off by the business sponsor. Defined after the fact, it stops being a kill criterion and becomes a debate.

How does an AI POC connect to the downstream production engineering covered in TK3-CCU-08?

The POC produces the inputs that the production-engineering phase consumes: a frozen model specification, a documented data pipeline, an integration-risk register, and a measured performance envelope. Without those inputs, the GenAI prototype-to-production work described in our prototype-to-production guidance starts from scratch. With them, it starts from a known baseline.

— A failure to define a kill criterion is a scoping failure, not a modelling failure. It belongs in the requirements document, not in the post-mortem.