Deep Learning for Computer Vision: Architectures, Training, and What Still Matters from Classical CV

Deep learning is the reason computer vision became practical at industrial scale. Before 2012, every new vision task meant a new feature pipeline. After AlexNet, the dominant pattern flipped: collect data, pick an architecture, train, deploy. A decade and a half later the recipe has matured, but the trade-offs are sharper than the marketing suggests. This article covers what actually works in production, what to learn first, and where classical computer vision still beats a neural network.

The framing question matters because it sits one level below the build-vs-buy decision: even teams that have decided to build still have to choose between a deep-learning approach and a classical pipeline. Our companion piece on when to build a custom computer vision model versus use an off-the-shelf solution covers the upstream decision; this one assumes you have already decided the task warrants custom development and asks what the model should actually look like.

Why Deep Learning Took Over

Three things lined up at once:

Convolutional neural networks could learn visual features end-to-end instead of relying on hand-designed ones.
GPUs made training those networks economically viable. Anything close to modern training would take centuries on a CPU.
Large labelled datasets like ImageNet gave the field a common benchmark, which let progress compound.

The result was a step-change in accuracy on classification, detection, and segmentation tasks. Within a few years, the question stopped being “can a network learn this?” and became “can we collect enough data to train one cheaply?” That second question is the one that still drives most project economics today, and it is the question off-the-shelf vendors answer well right up until the production environment diverges from their training distribution.

The Architectures That Earn Their Cost

There are hundreds of published architectures. A working practitioner needs to know maybe ten. The ones that show up most in deployed systems:

Convolutional Networks

CNNs are still the default for many tasks. The families worth knowing:

ResNet. The skip-connection trick that unlocked very deep networks. Still a strong baseline.
EfficientNet. Optimised for the accuracy-per-FLOP curve. Common on edge hardware.
ConvNeXt. A modern CNN that competes with transformers on accuracy while keeping convolutional efficiency.

Vision Transformers

ViTs treat an image as a sequence of patches and apply self-attention. They scale better on very large datasets and have become the backbone for foundation models — CLIP, DINO, SAM. They cost more compute per parameter than CNNs but unlock capabilities that CNNs do not.

Object Detection Heads

YOLO (v5, v8, v11), DETR, and RT-DETR are the practical choices for “find and locate.” YOLO dominates real-time edge deployments. DETR-style models are catching up and are easier to extend with additional output heads.

Segmentation Models

U-Net for medical and scientific imaging, DeepLab for general semantic segmentation, Mask R-CNN for instance segmentation, SAM for zero-shot prompt-driven segmentation. Each has a clear sweet spot.

Foundation Models

CLIP, DINO, SAM, and their successors changed the workflow. Instead of training a model from scratch, the pattern now is: take a pre-trained foundation model, freeze most of it, and fine-tune a small head for your task. In our experience across deployed CV engagements, this typically reduces required labelled data by roughly an order of magnitude or two compared with from-scratch training — an observed-pattern range, not a benchmarked rate, since the actual factor depends on how close the foundation model’s pretraining distribution is to your domain.

How Does Training Actually Work on Real Data?

Tutorials show clean datasets and steady loss curves. Real projects do not. The training loop in production looks more like this:

Collect raw data from the target environment. Cameras, lighting, distance, angles must match deployment.
Label a first batch carefully. Two annotators on a sample of frames to measure agreement. Rewrite the label spec until agreement is above 90%.
Fine-tune a foundation model as a starting point. Resist the urge to train from scratch.
Look at the failures. Run inference on a held-out set and visually inspect the worst predictions. Most insight comes from this step.
Targeted data collection. The errors tell you what data is missing. Collect or synthesise more of that.
Repeat. Three or four cycles usually beat any clever architecture change.
Calibrate the threshold for your task. The default 0.5 confidence cutoff is almost never right.
Lock the model and write the eval harness before deployment, not after.

Most of the engineering work sits in steps 4–7, not in the model definition. This is also the part that determines whether off-the-shelf survives contact with the deployment environment: a vendor model that was excellent on its own benchmark can fall apart on your held-out set in step 4, and the only honest response is to start the loop over with data you control.

Where Classical Computer Vision Still Wins

Deep learning is not always the right tool. Classical methods — edge detection, template matching, contour analysis, geometric transforms — beat neural networks when:

The task is geometric, not perceptual. Measuring the angle of a known part on a fixture does not need a CNN.
The dataset is tiny. With twenty examples, a Hough transform or SIFT-based matcher will outperform a poorly-trained network.
Latency or power is the binding constraint. A few OpenCV operations run faster than even a quantised network on the smallest devices.
Explainability matters. A classical pipeline can be inspected step by step. A neural network is a black box even when it works.
The conditions are tightly controlled. Fixed lighting, fixed camera, fixed background — exactly the conditions where classical methods were always strongest.

A good practitioner knows when to skip the network entirely, and a good vendor evaluation respects this. If your problem genuinely lives in one of these regimes, paying for a deep-learning vendor stack is paying for capability you will not use — and inheriting maintenance burden you do not need.

Decision: Custom Deep Learning vs Off-the-Shelf vs Classical

The three options have different cost shapes, different time-to-value, and different failure modes. The table below summarises how we usually frame the trade-off when scoping a new CV project; evidence class is observed-pattern across all rows — these are planning heuristics from deployed engagements, not vendor-published benchmarks.

Criterion	Off-the-shelf vendor model	Custom deep learning	Classical CV pipeline
Time to first working result	Days	2–4 months	1–4 weeks
Sensitivity to domain shift	High — fails when production diverges from vendor training distribution	Low to medium — you control the training data	Low — behaviour is geometrically specified
Labelled data required	None up front; degrades silently when domain shifts	Hundreds to thousands of task-specific labels for fine-tuning a foundation model	Effectively none; tuned on a handful of examples
Explainability	Low — black box behind an API	Low — black box you own	High — every step is inspectable
Maintenance ownership	Vendor controls model updates and deprecations	You own the loop end-to-end	You own a small, stable codebase
Best fit	Generic perceptual tasks in benign conditions	Domain-specific perception with available data	Geometric tasks in controlled conditions

The decision is rarely “which is best in general.” It is “which matches the gap between training conditions and production conditions for this deployment.” When the gap is small and the task is generic, off-the-shelf wins on time-to-value. When the gap is large or the data is proprietary, custom deep learning earns its cost. When the task is geometric and the scene is controlled, classical methods quietly outlast both.

Hardware and Deployment Realities

Training and inference have different hardware profiles. Training is throughput-bound and lives in the cloud on big GPUs. Inference is latency-bound and increasingly lives at the edge. The practical knobs:

Quantisation. FP16 or INT8 quantisation typically cuts inference cost 2–4× with minor accuracy loss — an observed pattern across deployed edge engagements, not a guaranteed result for every model. Worth the engineering investment for any high-volume deployment.
Pruning and distillation. Train a big model, distil it into a small one. Common pattern for shipping a 100MB model derived from a multi-GB teacher.
Hardware-aware training. Models trained with the target hardware in mind (Jetson, Coral, Hailo, mobile NPUs) consistently outperform generic models retargeted late.

Our GPU page goes into the training-side hardware in more depth.

What to Learn First

If you are new to deep learning for vision and want a path that compounds:

Train a CNN classifier on CIFAR-10 from scratch. Understand every line.
Fine-tune a pre-trained ResNet on a custom dataset of your own.
Train a YOLO detector on a small custom set. Learn how labels and anchors work.
Use SAM or CLIP for a zero-shot task without training. Understand what foundation models give you.
Deploy something to a Jetson or Coral. Latency, memory, and packaging will teach more than another paper.

Steps 4 and 5 are where most curricula stop short, and they are the ones that matter for shipping work.

What TechnoLynx Does in This Space

We build deep-learning vision systems for real products — defect detection on production lines, surveillance analytics, autonomous-vehicle perception. We also know when not to use deep learning, which saves clients more money than the model itself. If you are evaluating the approach for a project, our computer vision practice page describes how we usually scope this kind of work.

FAQ

When should I build a custom computer vision model versus use an off-the-shelf solution?

Build custom when production conditions diverge meaningfully from any vendor’s training distribution — non-standard cameras, unusual lighting, proprietary object classes, or accuracy targets the vendor’s published numbers do not reach. Use off-the-shelf when the task is generic (face, text, common objects), conditions are benign, and the vendor’s eval set is honestly representative of your scene. The decisive signal is the held-out test you run on a vendor model with your data, not the vendor’s marketing benchmarks.

What does “off-the-shelf CV” actually cover, and where does it run out?

Off-the-shelf today covers cloud vision APIs (Google Vision, AWS Rekognition, Azure Computer Vision), packaged SDKs for OCR, face, and barcode, and pre-trained model zoos (YOLO weights, Hugging Face checkpoints, NVIDIA TAO). It runs out when your input distribution drifts from common web imagery, when your label set is proprietary, when you need on-device inference with strict latency or power budgets, or when the vendor’s update cadence breaks your integration on its own schedule.

How do I estimate the engineering cost of a custom CV model before committing to it?

Estimate four cost lines separately: data acquisition (cameras, capture rigs, scene access), labelling (sample size × per-frame minutes × annotator rate, with a multiplier for QA passes), model engineering (typically 2–4 person-months for a fine-tuned foundation model on a well-scoped task), and deployment plus monitoring (often underestimated; budget at least as much as model engineering). The biggest variance is usually labelling — a 10× error there sinks the project.

Which signals tell me a vendor’s pre-trained model will fail on my data?

Five reliable signals: (1) your input resolution, aspect ratio, or colour space differs from the vendor’s training data; (2) your scene contains object classes outside the vendor’s published label set; (3) your lighting, weather, or occlusion conditions are systematically different from public datasets; (4) your accuracy requirement is above the vendor’s headline number — vendor numbers are usually best-case; (5) the vendor’s API does not expose confidence scores or per-class metrics on your test set. Any two of these together strongly predict failure.

What is the realistic time-to-value for a custom CV model versus a vendor solution?

Vendor solution: days to a working integration, weeks to discover its accuracy ceiling on your data. Custom deep-learning model fine-tuned from a foundation backbone: 2–4 months from data access to a production-ready model and eval harness, assuming labelled data is available or can be produced inside that window. From-scratch training on a novel task: 6–12 months and rarely worth it given the maturity of foundation models. These ranges are observed across our deployed engagements, not benchmarks.

Can I start with off-the-shelf and migrate to custom later without throwing the integration away?

Yes, if you design the integration around a stable internal interface from the start: define your own input and output schema, wrap the vendor API behind it, and treat the vendor as one implementation of that interface. When you swap in a custom model later, only the implementation changes — calling code, evaluation harness, and downstream consumers are untouched. Teams that skip this and call the vendor SDK directly end up rewriting more than the model when they migrate.