Object detection sits in a strange spot in 2025. The benchmark numbers look excellent — single-stage detectors clear 50+ mAP on COCO, DETR-family models edge higher, open-vocabulary systems can label things they were never explicitly trained on. And the production failure modes have barely moved. Small objects still degrade sharply. Occlusion still confuses confident models. Domain shift between the cameras a model was trained on and the cameras it gets deployed on still erodes accuracy in ways teams discover after launch, not before. This guide walks through what object detection actually is in 2025: the model families worth knowing, where each one earns its place, how much labelled data you really need, and — most importantly — where the engineering work lives between a strong benchmark score and a system that holds up under real conditions. For the deeper failure-class argument, see why off-the-shelf computer vision models fail in production; this piece is the practitioner-grade map of the landscape. What object detection actually does Object detection identifies and locates instances of object classes in an image or video. Unlike classification, which assigns a single label to an entire image, detection predicts a bounding box (or a segmentation mask, in the instance-segmentation variant) around each object plus a class label and a confidence score. That sounds simple. The hard parts sit underneath: Localisation accuracy — how tightly the predicted box hugs the object. Class accuracy — how reliably the class is correctly named, including the long tail. Recall — whether the model finds the object at all under occlusion, scale, and lighting variation. Latency — how many frames per second the system sustains on the target hardware. A model that scores 55 mAP on COCO can still miss half the small parts in a high-resolution manufacturing inspection feed. That gap is where most of the production work happens. How does an object detector work? Modern detectors share a common skeleton: a backbone (typically a CNN like ConvNeXt-V2 or a vision transformer like DINOv2 or EVA-02) extracts hierarchical feature maps; a neck (FPN, PANet, or transformer cross-attention) fuses features across scales; and a head predicts boxes and classes. The differences between detector families are in how that head works. One-stage detectors predict boxes and classes in a single forward pass directly from the feature maps. YOLO11, YOLOv12, YOLO-NAS, SSD, and RetinaNet sit here. They are fast and architecturally simple. Two-stage detectors first propose candidate regions and then classify and refine each one. Faster R-CNN, Mask R-CNN, and Cascade R-CNN are the canonical examples. Slower, historically more accurate on small objects and crowded scenes. DETR-family detectors treat detection as a set-prediction problem solved end-to-end with attention — no anchor boxes, no proposal stage. DINO-DETR, Co-DETR, RT-DETR, and ViTDet are the current state of the art on closed-set benchmarks. Open-vocabulary detectors accept a text prompt and detect categories they were never explicitly trained to recognise. Grounding DINO, OWL-ViT, OWLv2, and YOLO-World are the working examples in 2025. Promptable segmentation models like SAM2 and EfficientSAM pair with a detector to produce pixel-level masks from box or point prompts, useful for the long tail where labelled detection data is thin. Most production stacks combine a real-time backbone with an open-vocabulary fallback for rare classes — the closed-set detector handles the volume, the open-vocabulary model handles the long tail without retraining. Comparing the main detector families The choice between families is rarely about peak mAP. It is about latency budget, dataset size, and how open the class set needs to be. Family Examples Best for Trade-off One-stage CNN YOLO11, YOLOv12, YOLO-NAS, SSD Real-time edge / GPU, closed-set Lower small-object recall historically Two-stage CNN Faster R-CNN, Mask R-CNN Small objects, crowded scenes, masks 2–5× slower than one-stage DETR-family DINO-DETR, Co-DETR, RT-DETR, ViTDet Best closed-set accuracy, end-to-end training Heavier, longer training schedules Open-vocabulary Grounding DINO, OWL-ViT, OWLv2, YOLO-World Long-tail classes, no label set Slower; accuracy depends on prompt quality Promptable segmentation SAM2, EfficientSAM Mask generation from box / point prompts Needs an upstream detector or prompt source The signal-to-noise gain in recent years has come from DETR-family and open-vocabulary work. The signal-to-noise gain in deployment has come from being honest about which family fits the latency budget at the chosen resolution. How much training data do you need? This is the question every CV team asks too late. The answer in 2025 is roughly the same shape as it was in 2022, just with stronger pretrained backbones: Fine-tune from a pretrained backbone (the usual approach): typically 200 to 2,000 labelled instances per class for a useful initial model. The lower end works for visually distinctive classes (a forklift on a warehouse floor). The higher end is needed for fine-grained or low-contrast targets (a hairline crack on a casting, two species of pest that differ only in wing pattern). Pure from-scratch training is rarely justified. Strong pretrained backbones (DINOv2, EVA-02, ConvNeXt-V2) exist for almost every imaging modality, including some medical and satellite domains. Synthetic data can fill specific gaps — rare failure cases, unsafe-to-photograph scenarios, or class-balanced augmentation — but it does not substitute for representative real data from the deployment environment. The label count is the headline number, but the harder question is which 1,000 examples. A balanced set spanning the lighting, scale, occlusion, and camera variations of the deployment environment will outperform a 10,000-image set drawn from one corner of the operating range. Where object detection still fails in production This is the part most guides skip. Four failure modes are persistent enough that we plan for them on every engagement: Small objects in high-resolution scenes. Most detectors degrade sharply below roughly 32 pixels on a side. On a 4K feed with objects of interest that occupy 20 px, the standard pipeline misses them silently. Tiling the input, training on tiled crops, and explicit small-object augmentation (Mosaic, MixUp variants) are the engineering responses. Class confusion in fine-grained domains. Similar species, similar machine parts, similar product SKUs. The backbone’s features were not trained to separate them. Hard-negative mining and embedding-space contrastive losses are where the work goes. Heavy occlusion and crowding. Downtown traffic at rush hour, retail shelves, a busy assembly line. NMS-based detectors merge or drop crowded boxes. DETR-family models handle this better in principle; in practice, the gain is real but bounded. Domain shift from training distribution to deployment cameras. Different sensor noise profile, different lens, different white balance, different weather, different lighting. The model that scored 0.92 mAP on the curated validation set scores 0.71 on the first month of production frames. This is the failure mode that surprises teams most reliably. Each is addressable with engineering work — tiling, hard-negative mining, domain-aware augmentation, fine-tuning on deployment data, expected-performance contracts rather than benchmark claims. None of it is free, and none of it shows up in a model card. A decision rubric for picking a detector Before downloading the latest YOLO release, work through these in order: Latency budget. Frames per second on the target hardware at the target resolution. If you need 60 FPS on a Jetson at 1080p, you are in YOLO11 / RT-DETR territory, not DINO-DETR. Class openness. Is the class set fixed, or does it grow over time? Fixed → closed-set detector. Growing → open-vocabulary detector or hybrid. Small-object density. What is the smallest object you must detect, in pixels at deployment resolution? Below ~32 px, plan for tiling. Mask requirement. Do downstream consumers need pixel masks or just boxes? Masks → Mask R-CNN, Cascade Mask R-CNN, or a detector + SAM2 pipeline. Available labelled data. Less than ~200 per class → open-vocabulary or few-shot route. 200 to 2,000 → fine-tune. More → consider end-to-end training of a DETR-family model. Production validation plan. What does the test set look like? Does it span the lighting, weather, occlusion, and camera variations you will actually see? If not, the headline mAP is decoration. Improving accuracy: what actually moves the needle The list of techniques to improve object detection is long. The list that reliably moves accuracy in production is short: Domain-matched data. A few thousand labelled frames from the actual deployment cameras outperforms any architecture change. Targeted augmentation. Match the augmentation distribution to the failure mode: rotation and scale for orientation variance, colour jitter and CLAHE for lighting, Mosaic for small objects, occlusion masks for crowded scenes. Hard-negative mining. Identify the false positives the model makes most confidently and resample or upweight them. This routinely yields the biggest single accuracy bump after the first round of training. Transfer learning from a domain-relevant backbone. ImageNet-pretrained weights are not always the best starting point — a backbone pretrained on satellite, medical, or industrial imagery (where available) transfers better. Test-time augmentation. Multi-scale or multi-crop inference for the cases where latency is not the binding constraint. Architectural upgrades — moving from YOLOv8 to YOLO11, or from Faster R-CNN to DINO-DETR — usually give a smaller delta than fixing the data pipeline. Real-time applications and where the constraints bite Object detection in 2025 runs in a wide range of settings, and the binding constraint differs in each: Traffic monitoring and smart cities. Latency and small-object recall at a distance. RT-DETR or YOLO11 with input tiling, deployed on roadside GPUs or edge boxes. Industrial quality inspection. Fine-grained class accuracy on small defects. Two-stage detectors or DETR-family models, with heavy domain-specific augmentation. See our note on computer vision for quality control in manufacturing. Autonomous driving and robotics. Multi-class detection at high frame rate under variable lighting and weather. One-stage detectors plus sensor fusion; the detector alone is never the whole system. Medical imaging. High accuracy on small, low-contrast targets, with strict false-positive budgets. Mask R-CNN family or DETR-family with segmentation heads, validated on patient-distribution-matched data. Retail and inventory. Crowded shelves, fine-grained SKUs, open-vocabulary requirements when product catalogues change weekly. Hybrid stacks with YOLO-World or Grounding DINO for the long tail. In every one of these, the model is the easy part. The production validation, monitoring, and update loop is the hard part. Ethical considerations worth naming Detection systems trained and deployed at scale touch privacy, bias, and accountability concerns that should be part of the engineering plan, not a footnote. Privacy. Camera-based detection in public or semi-public spaces requires explicit data-handling policy: what is retained, what is anonymised, who has access. Treat this as a system requirement, not a compliance afterthought. Bias. Training distributions that under-represent demographic groups, equipment variants, or environmental conditions produce models that fail unevenly. Stratified validation — accuracy reported per subgroup, not as a single number — is the operational fix. Transparency. When a detector drives a downstream action (a safety stop, an alert, a routing decision), the operating envelope should be documented: what the model handles well, where it degrades, and what triggers human review. These are engineering concerns. They live in the same backlog as latency and accuracy. Where the field is heading Three threads worth tracking in 2025 and into 2026: Open-vocabulary detection maturing into the default. Grounding DINO, OWLv2, and YOLO-World are good enough for production long-tail work today. The next step is open-vocabulary models with the latency profile of YOLO11. End-to-end DETR-family models pushing into edge. RT-DETR is the bridge — DETR accuracy at real-time latency. Expect more variants targeting edge hardware. Promptable segmentation as a building block. SAM2 and EfficientSAM make pixel-level annotation cheap. The downstream effect is more domain-specific detectors fine-tuned on smaller, cleaner datasets. The 3D detection, multi-modal, and self-supervised threads continue to advance, but the practical impact on most production CV systems in 2025 is still incremental. For the broader failure-class argument and what a production-readiness assessment actually checks, see why off-the-shelf computer vision models fail in production. For the practice-area context across our engagements, our Computer Vision R&D page is the entry point. FAQ Why do off-the-shelf computer vision models fail in production? Benchmarks measure accuracy on curated, balanced data under controlled conditions. Production runs against the cameras, lighting, occlusion, and class distribution of a specific environment, none of which match the benchmark. The failure is structural: small-object degradation below ~32 px, class confusion in fine-grained domains, occlusion handling that NMS-based detectors struggle with, and domain shift between the training distribution and the deployment cameras. A model that scores 0.92 mAP on its validation set commonly scores 0.65 to 0.75 on the first weeks of production frames. What kinds of edge cases break public detection / classification models in real deployments? Four recur on almost every engagement: small objects at high resolution, fine-grained class confusion, heavy occlusion and crowding, and sensor or environmental domain shift. Less common but still material: motion blur on fast-moving subjects, severe lens distortion at wide field-of-view, and reflective or transparent surfaces that violate the backbone’s implicit assumptions about object opacity. How do I test a CV model against production data before shipping it? Build a stratified validation set drawn from the actual deployment environment — same cameras, same time-of-day distribution, same weather, same occlusion patterns. Report accuracy per stratum (lighting condition, object scale band, occlusion level), not as a single mAP number. Measure latency and throughput on the target hardware, not a lab GPU. The goal is to know the failure envelope before launch, not to discover it from incident reports. What is the difference between one-stage and two-stage detectors? One-stage detectors (YOLO family, SSD, RetinaNet) predict boxes and classes in a single forward pass — fast, simpler, historically slightly lower accuracy on small or crowded scenes. Two-stage detectors (Faster R-CNN, Mask R-CNN, Cascade R-CNN) first propose candidate regions and then classify and refine them — slower, historically higher accuracy on small objects. Modern DETR-family models blur the distinction: end-to-end with attention, no anchor or proposal stage, and competitive on both speed and accuracy. When is fine-tuning enough versus replacing the model entirely? Fine-tuning is enough when the failure is data-distribution-driven and the model class can represent the target task. If the deployment cameras differ from training, or the class set has shifted modestly, fine-tune on representative data. Replace the model when the failure is structural: the architecture cannot resolve objects at the required scale, the class set has grown unbounded (open-vocabulary territory), or the latency profile is wrong for the target hardware. The honest test is whether a labelled set from the deployment environment, used for fine-tuning, closes the gap to the required accuracy band. Which object-detection problems are inherent to the model class versus solvable with more data? More data solves: class confusion in well-represented domains, domain shift to known camera variants, accuracy on object scales seen during training. More data does not solve: small-object detection below the backbone’s effective resolution (an architectural limit, requires tiling or a different backbone), open-vocabulary requirements (requires a model trained for it), and latency at a given resolution on given hardware (requires a smaller or faster architecture). The split matters because the engineering response is different — labelling effort versus architecture change. Image credits: Freepik.