Object detection is the load-bearing capability inside most production computer vision systems. It tells you not only that a pedestrian, a defect, or a tumour candidate is present, but where in the frame it sits and — increasingly — how it moves between frames. The published benchmarks suggest the problem is largely solved. The production reality is more uncomfortable: off-the-shelf detectors hit benchmark accuracy on benchmark distributions, then degrade systematically when deployed against lighting, occlusion, and class distributions that no public dataset rehearses. This piece walks through what object detection does well, where it is genuinely used, and the failure modes that practitioners learn to budget for. The pattern that matters most is not the architecture — it is the gap between a model that scores well on COCO and a model that holds its false-positive rate when the camera moves outdoors. What is object detection, and how does it differ from classification? Image classification tells you what is in a frame. Object detection tells you what is where, usually by emitting a bounding box and a class label with an associated confidence score. Modern detectors — YOLO variants, Faster R-CNN, RetinaNet, the DETR family — are built on convolutional neural networks or transformer backbones trained on labelled image corpora. Three operational distinctions matter more than the architecture choice: Capability Output Best for Classification One label per image Sorting, tagging Object detection Bounding boxes + labels Counting, tracking, localisation Semantic / instance segmentation Per-pixel masks Medical imaging, robotics, defect outlines A bounding box is cheap to compute and easy to consume downstream. Segmentation is more expensive but necessary when the exact shape is the answer — outlining a tumour in a CT slice, separating two overlapping parts on a conveyor, or measuring crop coverage from a drone. Where object detection actually earns its place The applications cited in vendor decks are real, but they earn the work for different reasons. We group them by the kind of failure cost the system has to absorb. Autonomous driving and driver assistance. Detection runs against pedestrians, cyclists, traffic signs, lane markings, and other vehicles, often at 30 frames per second across multiple cameras. The asymmetric cost — a missed pedestrian is not equivalent to a false positive on a plastic bag — drives most of the engineering decisions. Medical imaging. Detection narrows the field for a radiologist; segmentation refines the candidate. The system rarely makes the call, but it changes which images get reviewed first. Security and surveillance. Multi-camera tracking depends on stable detection across viewpoints. Re-identification across cameras is a separate, harder problem layered on top. Retail and inventory. Shelf monitoring, planogram compliance, and queue analytics. The economics work when detection latency is measured in seconds, not milliseconds. Manufacturing inspection. Defect detection on production lines — scratches, missing components, label alignment. Lighting and fixture geometry are controllable, which is the main reason these projects succeed. For a denser comparison of when machine-vision conventions beat general-purpose CV in manufacturing, see Machine Vision vs Computer Vision: Choosing the Right Inspection Approach for Manufacturing. The model classes you will actually encounter CNN-based two-stage detectors (Faster R-CNN) propose regions first, then classify them. They tend to be more accurate on small or cluttered objects and slower per frame. Single-stage detectors (YOLOv5 through YOLOv8, RetinaNet, SSD) collapse proposal and classification into one pass and run faster, at the cost of small-object recall. Transformer-based detectors (DETR, Deformable DETR, RT-DETR) remove the hand-tuned post-processing of NMS and increasingly match the throughput of YOLO on modern GPUs. In practice, the choice is usually constrained before it begins. If the deployment target is an NVIDIA Jetson, a Hailo accelerator, or a smartphone NPU, the runtime — TensorRT, ONNX Runtime, OpenVINO, Core ML — narrows the model family. Pretty much every team we work with ends up running a quantised YOLO variant or a distilled transformer on edge hardware, not the original training-time model. For a fuller catalogue of which lightweight models hold up under real load, see Best Lightweight Vision Models for Real-World Use. Why off-the-shelf detectors fail in production This is where the benchmark story breaks. A model that scores 50 mAP on COCO does not score 50 mAP on your camera at 5 a.m. against a wet road. The structural reasons are repeatable, and they show up in roughly this order across the projects we audit. Distribution shift. Training data is biased toward daylight, head-on framing, and balanced class counts. Production sees backlight, oblique angles, and class distributions where the rare class is the one that matters. Occlusion. Detectors trained on COCO-style imagery degrade sharply when the target is partially hidden. Tracking-by-detection breaks first. Small-object recall. A pedestrian 80 metres away occupies under 20 pixels of height. Recall drops well before the human eye agrees the object is hard. Throughput at deployed resolution. Benchmarks report mAP at 640 × 640. Production cameras often deliver 4K. Downsampling kills small objects; full-resolution inference kills the latency budget. Confidence calibration. A model trained on a balanced dataset emits poorly calibrated probabilities on an imbalanced production stream. Thresholds set in the lab no longer behave the same way in the field. This is an observed pattern across our computer vision engagements — not a benchmarked rate. The specific magnitude varies by deployment, which is precisely the point: the magnitude is unknown until you measure it against the actual stream. The architectural correction is to budget for production validation before deployment. We unpack the structural mechanism in Why Off-the-Shelf Computer Vision Models Fail in Production. What real-time actually means in this context “Real-time” is a slippery requirement. For autonomous driving, the system must produce a decision within the time budget of the next frame — typically 33 ms at 30 fps, sometimes lower. For retail analytics, real-time means “before the customer leaves the aisle,” which is several seconds. The hardware envelope, the model size, and the post-processing pipeline all shift with that definition. A few patterns we see consistently: Edge inference on Jetson-class hardware with TensorRT or DeepStream is the default for low-latency deployments. Cloud inference is rarely fast enough once network jitter is counted. Multi-camera deployments amortise the GPU cost by batching frames across cameras. Throughput per dollar improves; per-frame latency does not. Power and thermal limits constrain the model long before accuracy does. A model that works on a desk GPU may thermally throttle on a fanless edge device within minutes. The point is that “the model is fast enough” is not a property of the model — it is a property of the model, the runtime, the resolution, the post-processing stack, and the hardware, measured together. Training data and the cost of getting it wrong Detector quality tracks data quality more closely than architecture quality. The teams that ship reliable systems are the teams that invested in representative labelling, not the teams that chose the latest backbone. The failure modes worth budgeting for: Label noise. Inconsistent bounding-box conventions (tight vs loose, including shadow vs not) propagate directly into the model’s behaviour. Class imbalance. Rare classes need either oversampling, focal loss, or targeted hard-negative mining. None of these is free. Environmental coverage. If the training set lacks rainy nights, the model will underperform on rainy nights. Synthetic augmentation closes some of this gap; deployed-data fine-tuning closes more of it. Drift. Cameras get bumped. Lighting installations change. Class definitions evolve. A detector deployed once and never re-validated decays — not because the model changes but because the world does. Data drift is the failure mode most teams underestimate. We pull this thread harder in Data Quality Problems That Cause Computer Vision Systems to Degrade After Deployment. Tracking, segmentation, and what comes after detection Detection alone is rarely the finished product. The downstream stages matter: Object tracking. Detections per frame become object identities across frames. SORT, DeepSORT, ByteTrack, and BoT-SORT are the working horses. The quality of tracking depends entirely on the quality of detection underneath it. Segmentation. When the answer is a shape, not a box. Mask R-CNN, YOLACT, and SAM-based pipelines dominate, with very different cost profiles. Action recognition. When the answer is a temporal pattern across frames. Layered on top of tracked detections. Each downstream stage compounds the error of the stage above it. A detector with 5% miss rate becomes a tracker with worse identity-switching than that, and an action recogniser worse still. Engineering the pipeline as a single system — not as detection followed by separate downstream steps — is what separates demos from deployments. The modular CV pipeline architecture pattern covers this end-to-end. What changes next Two trends are reshaping object detection beyond incremental accuracy gains. The first is the spread of transformer-based detectors that close the latency gap with YOLO while removing the hand-tuned NMS step. The second is the integration of detection with language models — open-vocabulary detectors like GroundingDINO and OWL-ViT let teams query for novel classes without retraining, which changes the operational cost of a new use case. The architectures will keep moving. The deployment discipline — production validation, drift monitoring, expected-performance contracts — does not. That is where most projects either succeed or quietly fail. How TechnoLynx works on this We build production object-detection systems, not demos. Most engagements start with a Production CV Readiness Assessment: representative data collection, failure-mode characterisation under real environmental conditions, and an expected-performance contract that names false-positive and miss rates by environment. The model architecture is a downstream decision, not the starting point. Contact us if you have a detector that worked in the lab and is misbehaving in the field — that is the conversation we are set up to have. FAQ Why do off-the-shelf computer vision models fail in production? They are trained on distributions that do not match deployment conditions: lighting, occlusion, viewpoint, class balance, and resolution all shift, and confidence thresholds calibrated on benchmark data no longer hold. The failure is structural, not edge-case. What kinds of edge cases break public detection / classification models in real deployments? Small objects at deployed resolution, occlusion of the target class, lighting outside the training envelope (night, backlight, wet surfaces), and rare classes that were underrepresented during training. Drift over time compounds all of these. How do I test a CV model against production data before shipping it? Collect a representative validation set from the actual deployment cameras and conditions — not the training-time dataset — label it to the same convention as the production task, and measure mAP, miss rate, and false-positive rate per environment. Treat the result as an expected-performance contract, not a marketing number. What does it cost to discover an off-the-shelf model is wrong only after deployment? The direct cost is the false-positive or miss rate that the operational system absorbs. The indirect cost is larger: rework of the data pipeline, re-labelling, re-training, and the operational disruption of running a degraded system while the next version is built. When is fine-tuning enough versus replacing the model entirely? Fine-tuning is enough when the deployment distribution differs from the training distribution mainly in surface conditions — lighting, viewpoint, sensor — and the class taxonomy is preserved. Replace the model when the class taxonomy is different, the resolution regime is fundamentally different, or the latency budget rules out the original architecture. Which object-detection problems are inherent to the model class versus solvable with more data? Small-object recall and latency-resolution trade-offs are largely architectural. Class imbalance, environmental coverage, and calibration are data problems. Occlusion sits in between — better data helps, but architectural choices (temporal models, tracking-aware detectors) help more. Image credits: Freepik.