The demo worked perfectly
The object detection model scored 94% mAP on the evaluation dataset. The integration test passed. The stakeholder demo was clean — bounding boxes appeared where they should, confidence scores were high, and the engineering team felt ready to deploy. Four weeks into production, the false-positive rate was three times higher than testing predicted, the model missed an entire class of defect it had never encountered in training, and the operations team was spending more time managing the model’s errors than they had spent on the manual process it replaced.
This is not an unusual outcome. It is the expected outcome when an off-the-shelf model — YOLO, Faster R-CNN, EfficientDet, or any pre-trained detection architecture — is deployed into a production environment that differs from its training conditions in ways that benchmark evaluation does not measure. The failure is not in the model architecture. The failure is in the assumption that benchmark accuracy transfers to production reliability.
Where does the accuracy gap come from?
The gap between benchmark performance and production performance has specific, identifiable causes. Understanding these causes is the difference between diagnosing a deployment failure retroactively and preventing it structurally.
Lighting and environmental variation. Benchmark datasets are typically captured under controlled conditions — consistent lighting, stable backgrounds, uniform image quality. Production environments are not controlled in the same way. A warehouse camera operates under fluorescent lighting that shifts colour temperature across the day. An outdoor surveillance system contends with shadows, glare, weather, and seasonal lighting changes. A manufacturing inspection station has lighting that degrades as bulbs age. Each of these variations introduces a distribution shift between the training data and the production data — and the model’s accuracy degrades proportionally to the magnitude of that shift, often without any visible error signal until someone audits the results.
Class distribution mismatch. Benchmark datasets are typically class-balanced: roughly equal numbers of examples per category, or at least a distribution that is representative of the evaluation task. Production environments are rarely class-balanced. In manufacturing quality control, 97–99% of units are defect-free — the positive class (defect present) is extremely rare. A model trained on a balanced dataset will produce a different precision-recall trade-off in production than it showed during evaluation, because the base rate of the positive class has changed by an order of magnitude. The practical consequence: a false-positive rate that was acceptable at 1% in evaluation becomes operationally problematic when it is applied to millions of units per month.
Domain-specific failure modes. Every deployment domain has failure classes that are specific to its operational context — and that off-the-shelf models have never seen. A retail shelf monitoring system encounters products that partially occlude each other, promotional displays that change the visual context weekly, and product packaging redesigns that change the appearance of items the model was trained to recognise. A medical imaging system encounters imaging artifacts, patient positioning variations, and pathology presentations that differ from the training distribution. These are not edge cases — they are the normal operating conditions of the specific domain, and they are invisible to a model that was trained on a generic or cross-domain dataset.
Why testing on a held-out set does not catch these failures
The standard ML evaluation methodology — train on one portion of the dataset, evaluate on a held-out portion — measures the model’s ability to generalise within the training distribution. It does not measure the model’s ability to generalise to a different distribution, which is exactly what production deployment requires.
A held-out test set drawn from the same dataset as the training data shares the same lighting conditions, the same class distribution, the same domain characteristics, and the same failure modes. Evaluating on this set tells you how well the model has learned the dataset. It does not tell you how the model will behave when the camera angle changes, the lighting shifts, the product mix evolves, or a defect type appears that was not represented in the training data.
We encounter this pattern regularly: a team evaluates a model on a held-out set, reports strong metrics, deploys to production, and discovers that the production accuracy is 10–20 percentage points below the evaluation accuracy. The team’s first instinct is usually to retrain with more data or try a different architecture. In our experience, the more productive first step is to characterise the distribution gap between training data and production data — because the gap, once identified, often reveals specific correctable causes (lighting normalisation, class rebalancing, domain-specific augmentation) rather than requiring a wholesale model replacement.
What production-grade evaluation actually requires
Moving from benchmark evaluation to production evaluation requires testing against the actual conditions of deployment, not against a subset of the training distribution.
Environment-representative test data. The evaluation dataset must be captured from the production environment — same cameras, same lighting, same operating conditions, same class distribution. If the production environment changes across shifts, seasons, or product cycles, the evaluation dataset must include samples from each variant. This is more expensive to construct than a curated benchmark dataset, but it is the only evaluation approach that predicts production performance.
Domain-specific metrics. Overall accuracy and mAP are useful for architecture comparison but insufficient for production decision-making. Production evaluation requires metrics that map to operational impact: false-positive rate at the operating threshold (how many good items will be incorrectly flagged?), false-negative rate per defect class (which defect types will be missed?), performance across data subsets (does the model degrade for specific product variants, lighting conditions, or time periods?), and latency under production load (can the model maintain throughput at line speed?). These metrics are not exotic — they are the questions that the operations team will ask after deployment, and answering them before deployment prevents the discovery phase from happening in production.
Out-of-distribution behaviour characterisation. What happens when the model encounters an input it was not trained on? Does it assign a low confidence score (desirable — the system can flag uncertain cases for human review) or a high confidence score on an incorrect class (dangerous — the system fails silently)? Characterising this behaviour before deployment requires deliberately testing with inputs that fall outside the training distribution — novel objects, adversarial lighting, corrupted images. The model’s behaviour on these inputs determines whether it fails safely or fails silently, which is the difference between a production system that degrades gracefully and one that produces undetected errors.
The quality control workflows that integrate AI and computer vision depend entirely on this production-grade evaluation. A model that has not been evaluated against production conditions is a model whose production failure rate is unknown — not zero, unknown.
The production readiness question
The decision to deploy a computer vision model is not a binary pass/fail on a benchmark. It is an assessment of whether the model, the data pipeline, the deployment infrastructure, and the monitoring systems are collectively ready to operate reliably under production conditions — with known and documented performance characteristics, not aspirational ones.
Off-the-shelf models are useful starting points. Transfer learning from pre-trained architectures (ResNet, EfficientNet, Vision Transformers) reduces training time and data requirements. The failure is not in using these architectures — it is in deploying them without production-representative evaluation, without domain-specific fine-tuning, and without monitoring infrastructure that detects when production conditions drift away from training conditions.
If your team has a computer vision system that performs well in testing but has not been validated against production conditions, a Production CV Readiness Assessment identifies the specific gaps — data distribution, environmental factors, class balance, and latency — before deployment, so the false-positive cost is known rather than discovered. Learn more about our computer vision practice.