AI POC Design: What Success Criteria to Define Before You Start

Most AI POCs answer the wrong question

An AI proof of concept that ends with “the model achieved 87% accuracy on our test set” has not proven anything useful. Accuracy on a held-out test set does not tell you whether the AI will improve the business outcome it was built for, at the cost and latency the business requires, with the reliability the system needs in production.

A well-designed POC ends with a clear go/no-go decision based on criteria that were defined before the project started. The criteria — not the model — are the deliverable.

The four criteria that determine POC design

Before starting any AI POC, these four questions must have specific answers. We treat them as gating questions: if a stakeholder cannot answer all four in writing, the POC is not yet ready to start.

1. What is the baseline?

What is the current performance of the process the AI will replace or augment? Without a baseline, there is no way to measure improvement. “We want to automate X” is not a baseline. “The current process takes 4 minutes per transaction with a 12% error rate” is a baseline. The baseline should be measured on the same data and operating window the AI will face — not a sanitised reference set.

2. What does success look like in business terms?

Not model accuracy. Business outcome: cost per unit, time per transaction, error rate, conversion rate, false positive rate. Define the minimum threshold at which the AI creates enough value to justify deployment costs. A useful framing: state the threshold at which the project moves to production, and the threshold at which it would not, even if the model is technically interesting.

3. What does failure look like?

Define the failure condition: if the AI achieves below X, we do not deploy. This is as important as the success condition. POCs without failure conditions almost always find a way to declare success — the team writes around the gaps, the slide deck reframes the metric, and the organisation ends up with a “successful” pilot it cannot actually use.

4. What is the path from POC to production?

If the POC succeeds, what are the next steps? Who owns deployment? What infrastructure is required? What is the budget? If there is no clear path to production at the start, a successful POC often leads to nothing — a phenomenon common enough that survey reports from analyst firms such as Gartner have repeatedly placed POC-to-production conversion rates well below half of started pilots (a directional industry-scale figure, not an operational benchmark).

Typical six-week AI POC structure

Week	Activities	Deliverable
1	Data audit, baseline measurement, success criteria finalisation	Confirmed data availability, success/failure thresholds
2–3	Data preparation, baseline (heuristic) model, initial ML model development	Baseline performance, first ML candidate
4	Model iteration, validation on representative slices	Performance on evaluation set, calibration check
5	Integration prototype, latency and cost-per-call measurement	End-to-end performance metrics under realistic load
6	Decision review against pre-defined criteria	Go/no-go recommendation with evidence packet

The shape matters more than the exact dates. The first week is for criteria and data, not modelling. The fifth week is for integration reality, not last-minute accuracy gains. The sixth week is for the decision, not a sales pitch.

What scope to avoid in a POC

A POC should not attempt to solve the full production problem. The scope should be the minimum that can demonstrate whether the core technical assumption is valid.

Common scope mistakes:

Trying to handle all edge cases during the POC.
Building full production infrastructure before validating the model.
Attempting to integrate with all downstream systems.
Optimising model performance before validating business impact.

Each of these turns a six-week experiment into a three-month build with no clearer answer at the end. The discipline is to compress the question, not the answer.

For the broader framework of what a proof of concept should prove and how to structure it, what an AI POC should actually prove covers the parent methodology in depth.

What separates a useful POC from a misleading one?

A useful AI POC answers a specific question: “Can an ML model achieve X performance on Y data under Z constraints?” A misleading POC answers a vague question: “Can AI help with our business problem?” The specificity of the question determines whether the POC’s results are actionable.

Defining success criteria before starting the POC is the most important step. Without predefined criteria, the POC becomes a demo rather than an experiment — results are interpreted favourably regardless of actual performance because there is no objective standard for comparison. We define success criteria collaboratively with business stakeholders, translating business requirements (e.g., “reduce manual review time by 50%”) into measurable model performance targets (e.g., “achieve 90% precision at 80% recall on the operating dataset”).

Data representativeness is the second critical factor. A POC trained on curated, clean data demonstrates what the model can do under ideal conditions. A POC trained on representative production data demonstrates what the model will do under real conditions. The gap between these is, as an observed pattern across our engagements, often in the range of 10–20 percentage points of headline metric — not a benchmarked rate, but enough to invalidate a go decision built on curated data alone.

We structure POCs in two phases. Phase 1 (two weeks): establish a baseline using the simplest viable approach — often a non-ML heuristic, sometimes a regex pipeline, sometimes a small classical model in scikit-learn — to confirm that the problem is well-defined and the data supports the task. Phase 2 (two to four weeks): develop and evaluate the ML approach against the Phase 1 baseline, typically in PyTorch or via a fine-tuned transformer when the task warrants it. If the ML approach does not meaningfully outperform the baseline, the POC has produced a valuable negative result — the problem does not benefit from ML given the available data, and the organisation should not invest in a production ML deployment.

The two-phase structure prevents the most common POC failure: spending six weeks developing a complex model only to discover that the data does not support the task, or that a simple rule-based approach performs equivalently.

Choosing the right evaluation metrics is the final critical design decision. Classification accuracy is rarely sufficient — precision and recall at specific operating points, calibration quality, and performance on minority classes matter more than aggregate accuracy for most production applications. We define metrics that align with the business cost function: if false positives are ten times more costly than false negatives, the evaluation metric should reflect that asymmetry, not hide it behind a single accuracy number.

Packageable value if the POC stops early

Even when the answer is no-go, a well-scoped POC leaves the organisation with usable artifacts: the documented data audit, the baseline-vs-ML comparison, the latency and cost-per-call figures from the integration prototype, and a written record of which assumptions held and which did not. These are reusable inputs for the next attempt — whether that is a different model class, a different problem framing, or a decision to fix the data pipeline before any ML work resumes.

That is the test we use to judge POC design before week one starts: if the project were cancelled at week four, would the organisation still own something it can act on? If the honest answer is no, the POC is structured as a demo, not an experiment.

FAQ

What should an AI proof of concept actually prove before an organisation commits to a full build?

It should prove that the model meets pre-defined business-outcome thresholds on representative production data, under the latency and cost constraints production requires, and that a clear path to deployment exists. Model accuracy alone is not sufficient.

What is the difference between a demo, a prototype, and a POC — and why does each fail at a different stage?

A demo shows the idea on curated data and fails when stakeholders ask for production behaviour. A prototype shows the system end-to-end on a narrow slice and fails when integration or scale is added. A POC tests pre-defined success and failure criteria against representative data and fails most often when those criteria were never written down.

Which evaluation evidence must come out of a POC to be useful downstream?

At minimum: a documented baseline, performance on representative data with calibration and minority-class behaviour, latency and cost measurements from an integration prototype, and a data-lineage record showing what was used for training and evaluation.

What is the realistic failure rate of AI POCs, and which scoping choices drive it?

Industry surveys from analyst firms have repeatedly placed POC-to-production conversion well below half — a directional industry-scale figure, not an operational benchmark. The dominant scoping drivers we see are: no written failure condition, curated rather than production-representative data, and no defined owner for the path to deployment.

When does a POC need a clean kill criterion, and how should that be defined up front?

Always. The kill criterion should be a specific business-outcome threshold (not a model metric) below which deployment is not justified, agreed with the stakeholder who owns the deployment budget, and recorded in the POC charter before week one.

How does an AI POC connect to the downstream production engineering covered in TK3-CCU-08?

The POC produces the evidence packet — baseline, performance envelope, integration risk, cost-per-call — that the production engineering phase consumes. Without that packet, the production team is re-running the POC inside the production project, which is where most of the time and budget overruns originate.

AI POC Design: What Success Criteria to Define Before You Start

Most AI POCs answer the wrong question

The four criteria that determine POC design

Typical six-week AI POC structure

What scope to avoid in a POC

What separates a useful POC from a misleading one?

Packageable value if the POC stops early

FAQ

What an AI POC Should Actually Prove — and the Four Sections Every POC Report Needs

AI POC Requirements: What to Define Before Building a Proof of Concept

Enterprise AI Failure Rate: Why Most Projects Don't Reach Production

Data Science Team Structure for AI Projects