Facial Recognition in Computer Vision: How the Pipeline Actually Works

Facial recognition is the most-discussed subdomain of computer vision and probably the most-misunderstood. Vendor demos show a face appearing in a frame, a green box snapping around it, and a name floating beside it — and the impression left behind is that this is one thing the AI does. It isn’t. A modern facial-recognition system is a four-stage pipeline: face detection, face alignment, face embedding, and identity matching against a gallery. Each stage runs on a different model, fails for different reasons, and creates a different category of legal exposure. Treating it as one opaque box is how procurement decisions go wrong and how engineering teams underestimate the build.

How the four-stage pipeline actually decomposes

The work of recognising a face is not done by a single model. It is done by a chain.

Stage 1 — Face detection. A detector scans the image and returns bounding boxes for every face it finds, usually with a handful of landmark coordinates (eyes, nose tip, mouth corners). In production deployments built since 2020, RetinaFace, SCRFD, and YOLO-face variants are the typical choices. Earlier Haar-cascade and HOG-based detectors still appear in legacy systems but struggle with off-axis poses, occlusions, and low light. We see this difference matter most on CCTV-grade inputs, where the detector’s robustness sets the ceiling for everything downstream.

Stage 2 — Alignment. Using the landmarks from stage 1, the face is rotated, scaled, and cropped to a canonical pose — typically a 112×112 patch with the eyes on a fixed horizontal line. Alignment exists because the embedding model in stage 3 was trained on aligned faces; feeding it un-aligned crops collapses accuracy. This is the quiet stage that most pipeline diagrams skip and most failure post-mortems eventually return to.

Stage 3 — Embedding. A deep network — ArcFace, MagFace, AdaFace, and the various WebFace-trained variants are the public reference points — converts the aligned face into a fixed-size feature vector, usually 512 dimensions. The vector is the face’s identity in the system’s geometry: two images of the same person should land close together in that space, two images of different people should land far apart. The embedding model is the single most consequential choice in the whole pipeline.

Stage 4 — Matching. The embedding is compared against a gallery of stored embeddings using cosine similarity, and a threshold decides “this is the same person” or “this is not.” Below the threshold is a non-match; above it, a match. The threshold is not a constant set at training time — it is a deployment-time decision that trades false matches against false non-matches, and it is the most under-documented number in most procurement conversations.

The pipeline runs end-to-end in well under 50 ms per face on modern edge hardware, which is why phone unlock feels instant and airport e-gates feel like a slight pause.

Why MTCNN was preferred over Haar cascades — and where that trade-off now flips

For most of the 2010s, the canonical worked example for face detection in a textbook was MTCNN — a three-stage cascade of small convolutional networks — held up against the older Haar-cascade detector that shipped with OpenCV. MTCNN was preferred because it handled in-plane rotation, partial occlusion, and varying scale more gracefully, and because it returned landmarks suitable for alignment. Haar cascades were faster on CPU but brittle in the conditions that production cameras actually produce.

That comparison is no longer the operative one. Both MTCNN and Haar are now legacy choices. Single-shot detectors — RetinaFace, SCRFD, and the YOLO-face family — outperform MTCNN on the standard benchmarks (WIDER FACE) at comparable or lower latency, and they run cleanly through TensorRT, ONNX Runtime, or CoreML for edge deployment. The honest framing for a 2026 build is that Haar cascades belong in classrooms and embedded prototypes; MTCNN is a sensible mid-grade fallback when you need pure PyTorch with no heavy dependencies; and a modern single-shot detector is the production default. The trade-off only flips back toward simpler detectors when compute is genuinely scarce — microcontroller-class hardware or strict thermal budgets — and even then, the smaller variants of SCRFD usually fit.

Where facial recognition sits inside the broader CV stack

Facial recognition is not its own discipline parallel to computer vision; it is a vertical slice through the same stack. The detection stage uses the same object-detection machinery that drives general CV — anchors, feature pyramids, non-maximum suppression. The embedding stage is a specialised application of representation learning, the same family of techniques that powers image retrieval and re-identification in retail and surveillance contexts. Pattern recognition shows up in the matching stage, where the question reduces to a nearest-neighbour search in embedding space.

The reason this matters is that improvements in one layer tend to propagate. Better general-purpose backbones (transformers, ConvNeXt-style architectures) trickle into face embeddings within a year or two. Detection architectures that started in object detection — DETR, YOLO, RetinaNet — keep getting adapted into face-specific variants. A team that already runs a serious CV pipeline for a different problem usually has most of the engineering substrate it needs to add face recognition, with the embedding model and the gallery system as the new pieces. We touch on this layering more broadly in our overview of core computer-vision algorithms and where each one fits.

What accuracy and bias actually look like in 2026 deployments

Published face-recognition benchmarks routinely report verification accuracy above 99.5% on standard test sets, and vendor materials lean on those numbers heavily. The operationally relevant measure is different. What matters in deployment is the false match rate (FMR) and the false non-match rate (FNMR) at the specific operating threshold you have set, on data that resembles your cameras and your population. Those two numbers move in opposite directions as the threshold moves; the only honest accuracy claim is a paired one.

The NIST Face Recognition Vendor Test (FRVT) is the closest thing to a public benchmark that tracks this properly. Two findings have stayed remarkably stable across its iterations. First, accuracy disparities across demographic groups — by skin tone, by age, by gender presentation — are smaller in 2026 than in 2018, but they are not zero, and they show up in different ways for different algorithms. Second, the gap between a strong vendor and a weak one is much larger than the gap between the strongest vendor and the theoretical ceiling. Buying decisions made on demo footage instead of FRVT data are buying decisions made on the wrong signal.

The bias work by Buolamwini and Gebru (Gender Shades, 2018) and the subsequent NIST demographic effects studies remain the canonical references here. A 2026 deployment that does not include a demographic-stratified accuracy audit, with results documented for each group its cameras will actually see, is a deployment carrying unmeasured risk.

Which CV algorithms still matter for face recognition, and which are obsolete

A clean ranking by current relevance:

Algorithm family	Era	Relevance in 2026
Eigenfaces / Fisherfaces (PCA-based)	1991–2000s	Obsolete in production; useful only as a teaching baseline.
Local Binary Patterns (LBPH)	2000s	Obsolete; still ships in OpenCV samples.
Haar-cascade detection	2001 onward	Legacy only; embedded prototypes and tutorials.
HOG + linear SVM (e.g. dlib)	2010s	Niche — works without a GPU, fine for low-traffic access control.
MTCNN detection	2016	Mid-grade fallback; superseded by single-shot detectors.
Triplet-loss embeddings (FaceNet)	2015	Historically important; superseded by margin-based losses.
ArcFace / CosFace / SphereFace	2018 onward	Current production default for embeddings.
MagFace, AdaFace, WebFace variants	2021 onward	State of the art on open benchmarks.
Transformer-based face encoders	2022 onward	Catching up to ArcFace-family on accuracy, ahead on robustness to occlusion.

The headline is that the algorithms taught as “facial recognition” in older textbooks — eigenfaces and LBPH — have not been part of a credible production system for more than a decade. Anyone evaluating a vendor whose pitch leans on those names is looking at a stack that is two architectural generations behind.

How deployment shape changes between cloud, on-device, and edge

The same four-stage pipeline behaves very differently in different deployment environments, and the trade-offs are mostly about who holds the embedding gallery.

Cloud deployment. Frames or detected faces travel to a server, where detection (sometimes), alignment, embedding, and matching all happen. The gallery is centralised, refresh and audit are straightforward, and the model can be the heaviest variant available. The price is latency (50–300 ms round trip), bandwidth, and the regulatory weight of transmitting biometric data off-device.

On-device deployment. The full pipeline runs on the user’s phone or laptop. Face ID, Windows Hello, and Android face unlock are the canonical examples. The gallery is one person — the device owner — which sidesteps most of the gallery-management questions. Privacy posture is the strongest of the three because biometric data never leaves the hardware enclave.

Edge deployment. Cameras or small appliances run detection and embedding locally, then send embeddings (not images) to a local matching service. This is the common pattern for enterprise access control and modern airport e-gates. The engineering effort is largest here — model quantisation, hardware acceleration via TensorRT or CoreML, thermal budgets, OTA update mechanics — but the operational properties (low latency, no continuous cloud dependency, narrow data egress) are what make the pattern attractive for sensitive deployments. The production failure modes that show up at this layer are the subject of our deeper write-up on CCTV face-recognition production challenges.

The deployment choice is not a pure engineering decision. EU AI Act provisions on real-time remote biometric identification, US state laws (Illinois BIPA, Texas CUBI, Washington’s biometric privacy law), and GDPR Article 9 treatment of biometric data as special-category personal data all push toward on-device or edge over cloud for any deployment that touches public space. A responsible 2026 build has a documented lawful basis, a data-protection impact assessment, a demographic-stratified accuracy audit, and human oversight on any identification that drives a consequential decision.

FAQ

How does the facial recognition pipeline decompose — detection, alignment, embedding, matching?

A modern pipeline has four stages: (1) face detection locates faces in the image, typically with RetinaFace, SCRFD, or a YOLO-face variant; (2) face alignment uses detected landmarks to normalise pose and scale to a canonical crop; (3) face embedding runs a deep network — ArcFace, MagFace, AdaFace — to produce a fixed-size feature vector; (4) identification or verification compares embeddings to a gallery using cosine similarity against a deployment-chosen threshold. End-to-end latency is under 50 ms per face on modern edge hardware.

Why is MTCNN typically preferred over Haar cascades in modern face detection, and where does that trade-off flip?

MTCNN handles rotation, partial occlusion, and varying scale better than Haar cascades and returns landmarks suitable for alignment, which is why it displaced Haar through the late 2010s. In 2026, both are legacy: single-shot detectors (RetinaFace, SCRFD, YOLO-face) outperform MTCNN at comparable or lower latency. The trade-off only flips back toward simpler detectors when compute is genuinely scarce — microcontroller-class hardware or strict thermal budgets — and even then, small SCRFD variants usually fit.

Where does facial recognition sit in the broader CV pipeline (image recognition, pattern recognition, deep learning)?

It is a vertical slice through the standard CV stack rather than a separate discipline. Detection reuses general object-detection machinery; embedding is a specialised application of representation learning shared with image retrieval and re-identification; matching is nearest-neighbour pattern recognition in embedding space. Improvements in general CV architectures (transformers, ConvNeXt) propagate into face-specific models within a release cycle or two.

What are the realistic accuracy and bias limits of production facial recognition in 2026 deployments?

Published verification accuracy above 99.5% is common on standard benchmarks but is the wrong number for procurement. The operationally relevant measure is the paired false match rate and false non-match rate at the deployment threshold, on data resembling the actual cameras and population. NIST FRVT iterations show demographic accuracy gaps have narrowed since 2018 but are not zero and vary across algorithms. A deployment without a demographic-stratified audit is carrying unmeasured risk.

Which CV algorithms (eigenfaces, deep embeddings, transformers) are still relevant for face recognition, and which are obsolete?

Obsolete in production: eigenfaces, Fisherfaces, LBPH, Haar-only pipelines. Legacy mid-grade: MTCNN, FaceNet-style triplet-loss embeddings. Current default: ArcFace-family margin-based embedding losses. State of the art: MagFace, AdaFace, WebFace-trained variants, and increasingly transformer-based face encoders for robustness to occlusion. Vendor pitches built on eigenfaces or LBPH are two architectural generations behind.

How does facial recognition deployment differ across cloud, on-device, and edge inference settings?

Cloud centralises the gallery and allows the heaviest models but adds latency, bandwidth, and regulatory weight. On-device (Face ID, Windows Hello) has a gallery of one and the strongest privacy posture because biometric data never leaves the hardware enclave. Edge runs detection and embedding locally and sends only embeddings to a local matcher — the dominant pattern for enterprise access control and modern e-gates, with the highest engineering effort but the operational properties (low latency, narrow data egress) that regulators increasingly expect for sensitive deployments.

The failure class worth naming at the end is the one that gets least attention in procurement: gallery drift. Embedding models change, thresholds get re-tuned, enrolled photos age, and the population the cameras see is not the population the model was trained on. Facial recognition is not a system you install and walk away from; it is a system you audit. Our work on production CV systems — including face detection in camera systems and video-surveillance accuracy under realistic load — is built around that operational reality rather than the demo one.

Image credits: Freepik

Facial Recognition in Computer Vision: How the Pipeline Actually Works

How the four-stage pipeline actually decomposes

Why MTCNN was preferred over Haar cascades — and where that trade-off now flips

Where facial recognition sits inside the broader CV stack

What accuracy and bias actually look like in 2026 deployments

Which CV algorithms still matter for face recognition, and which are obsolete

How deployment shape changes between cloud, on-device, and edge

FAQ

CCTV Face Recognition in Production: Why It Fails More Than Demos Suggest

Face Detection Camera Systems: Resolution, Lighting, and Real-World False Positive Rates

Facial Recognition Cameras for Commercial Deployment: Matching, Enrollment, and Legal Framework

Facial Recognition in Video Surveillance: Why Lab Accuracy Doesn't Transfer to CCTV