Production image processing is not benchmark image processing? The gap between research benchmarks and production performance is wider in image processing than in most machine learning domains. ImageNet top-1 accuracy tells you how a model performs on a well-curated, well-balanced, well-labelled dataset. It tells you very little about how it performs on your specific imaging hardware, under your lighting conditions, on your subject population, after six months of production operation. This article covers the practical engineering decisions for deep learning image processing systems that need to run reliably in production: model architecture selection, training data requirements, augmentation strategy, deployment optimisation, and managing the distribution shift that happens over time. For the broader context of handling unknown inputs in production CV systems, see the unknown object loop in retail CV. Practical comparison The two dominant architecture families for image processing tasks are Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The choice between them is not obvious and depends on training data availability, latency requirements, and task structure. Property CNN Vision Transformer (ViT) Inductive biases Strong (locality, translation equivariance) Weak — relies on data to learn structure Training data requirement Lower — inductive biases help with less data Higher — needs large datasets to learn spatial relationships Performance at scale Saturates earlier with data scale Continues to improve with more data Inference latency Lower — highly optimised CUDA kernels Higher — attention is compute-intensive Hardware efficiency Excellent on GPU and CPU Excellent on GPU; less efficient on CPU and embedded hardware Transfer learning Excellent Excellent when pretrained at scale (DINOv2, SAM) Interpretability Moderate (CAM, GradCAM) Moderate (attention maps) Small image size Handles well ViT patch size must be tuned; poor on very small images In our experience, CNNs remain the practical default for production image processing where: Training data is limited (under ~100k labelled samples) Inference must run on CPU or embedded hardware Latency is a hard constraint (under 20ms per image) The task is well-defined classification or detection ViTs are worth evaluating when: Large-scale pretraining is available for the domain (medical imaging, satellite imagery, etc.) Training data is abundant GPU inference is acceptable The task requires global context understanding (e.g., anomaly detection across the full image) Hybrid architectures (EfficientNet, ConvNeXt, MobileNetV3) offer competitive performance with deployment-friendly characteristics and are often the best practical choice when neither a pure CNN nor ViT clearly fits the requirements. Training data requirements Data requirements scale with task complexity and the degree of visual variation in the deployment environment. Rough minimums for common tasks: Task Minimum Training Samples Notes Binary classification (two well-separated classes) 500–2,000 per class With pretrained backbone; more needed for complex appearance variation Multi-class classification (5–20 classes) 1,000–5,000 per class More classes = more samples needed for inter-class discrimination Object detection (single object class) 1,000–3,000 annotated images With anchor-based detection; more for multi-scale variation Segmentation 500–2,000 annotated images Pixel-level annotation is expensive; consider weak supervision Anomaly detection (good-only training) 200–500 good samples More robust with 1,000+; scale with visual complexity These are minimums with appropriate pretrained backbones and augmentation. Training from scratch requires 5–10× more data. In our experience, most production projects underestimate the data requirement for edge cases and rare classes — the model performance on common cases looks acceptable early, and edge case failures only emerge under operational exposure. Data augmentation strategy Augmentation artificially expands training diversity and is one of the highest-leverage investments in training pipeline quality. But augmentation must be domain-appropriate — applying the wrong augmentations degrades rather than improves model performance. Generally safe augmentations (almost always beneficial): Horizontal and vertical flips (if not semantically meaningful, e.g., orientation matters) Random crops and resizing Brightness and contrast jitter (moderate range) Gaussian noise and blur Domain-specific augmentations (verify they match real variation): Rotation: beneficial if the deployment shows rotated objects; harmful if orientation is a class cue Colour jitter: appropriate for scenes with variable lighting; inappropriate if colour is a discriminating feature Cutout/random erasing: good for detecting partially occluded objects; may hurt if full visibility is required Augmentations to use carefully: Aggressive geometric distortion: can break texture-based features that matter Colour inversion or channel shuffle: rarely matches real variation; often hurts Synthetic data mixing (CutMix, MixUp): effective for classification; can confuse detection and segmentation models In production, track augmentation strategy separately from model architecture in experiment logs. Augmentation choices explain more performance differences across experiments than architecture choices in most production image processing scenarios. Deployment optimisation A model that runs at 2 seconds per image in a research environment must be optimised for production latency. Standard optimisation steps: Quantisation: converting model weights from FP32 to INT8 reduces model size by 4× and typically increases inference throughput by 2–4× on compatible hardware, with accuracy loss of 0.5–2% for well-calibrated quantisation. INT8 quantisation requires calibration data (representative input samples) for activation quantisation. Model pruning: removing low-importance weights or channels. Structured pruning (removing entire channels) is hardware-efficient; unstructured pruning requires sparse hardware support. In practice, quantisation before pruning is usually the better path — quantisation gives most of the speed improvement with less risk. TensorRT / ONNX Runtime: converting PyTorch or TensorFlow models to optimised inference runtimes. TensorRT on NVIDIA hardware typically gives 3–5× throughput improvement over native PyTorch inference for batch sizes of 1–16. Model distillation: training a smaller student model to match a larger teacher model’s output distribution. Produces smaller models that approach the accuracy of larger ones. Useful when the production hardware cannot run the full model at required throughput. Deployment optimisation decision checklist Latency requirement defined (milliseconds per image or images per second) Target hardware specified (GPU model, embedded accelerator, CPU) Baseline inference time measured on target hardware before optimisation INT8 quantisation accuracy validated on held-out test set ONNX export tested and validated (outputs match PyTorch) TensorRT/ONNX Runtime throughput benchmarked on target hardware Model size fits within memory budget of target device Handling distribution shift in production Distribution shift is the most insidious production failure mode: model accuracy degrades gradually as the input distribution drifts away from the training distribution, but the degradation is not obvious without active monitoring. Common sources of distribution shift: Camera hardware changes: different camera model, lens, or positioning changes image statistics Lighting changes: seasonal variation in natural light, replacement of lighting fixtures, changes in illumination in the scene Subject population changes: new product variants, new demographics, new defect types not seen in training Process changes: changes in manufacturing process, retail layout, or operational workflow that change what the camera sees Detection and response: Monitor confidence score distributions over time — a drop in average confidence without a corresponding change in labelled accuracy is an early warning sign Monitor prediction class distributions — a shift toward edge classes or unusual class imbalance may indicate input distribution change Implement periodic validation against a fixed held-out test set, not just production performance metrics When drift is detected, collect and label new samples from the current input distribution before retraining In our experience, teams that build monitoring into the deployment from day one detect drift early and respond with targeted retraining. Teams that deploy without monitoring discover drift only after users report a degradation in system performance — typically months after the drift began.