Face detection is the first stage of the facial recognition pipeline, and it is the one most teams underestimate. Everything downstream — alignment, embedding, gallery match — depends on whether the detector returned a usable crop in time. When real-time constraints are added (live video, security cameras, edge devices), the engineering problem stops being “can the model find a face” and becomes “can it find every face, in every frame, within the latency budget, on the hardware you actually have.” This article focuses on that stage. We cover how modern CNN-based detectors work, why they replaced earlier methods, what GPU acceleration buys you in practice, and where edge inference shifts the deployment story. For the full four-stage pipeline (detection → alignment → embedding → matching), see our companion explainer on facial recognition in computer vision. What “real-time” actually means for face detection Real-time is not a single number. For a smartphone unlock, the budget is roughly 100 ms end-to-end and one face matters. For a retail entrance camera at 30 fps, the budget is ~33 ms per frame and the detector may need to handle five to fifteen faces at varying scales. For a stadium-scale surveillance feed, the budget is per-frame but the throughput requirement is measured in faces per second across dozens of streams. These three regimes drive different architectural choices. The 100 ms phone budget tolerates a heavier single-shot detector because it runs once. The 33 ms multi-face budget rewards detectors that scale well with face count. The multi-stream surveillance regime is a throughput problem more than a latency problem and is almost always solved with batched GPU inference. Why CNNs replaced Haar cascades — and where the trade-off still flips The pre-CNN era of face detection ran on Haar cascades and HOG features. These methods are still in OpenCV, still fast on CPU, and still occasionally the right answer. They fail in the obvious ways: sensitivity to lighting, brittleness on rotated or partially occluded faces, and a high false-positive rate in cluttered scenes. CNN-based detectors — MTCNN, RetinaFace, YOLO-Face, SCRFD, and the newer transformer-hybrid detectors — handle those conditions much better. They learn features end-to-end from labelled face datasets rather than relying on hand-crafted edge templates. In practice, an MTCNN-class detector will give you an observed-pattern improvement in recall on hard inputs (poor lighting, profile angles, small faces at distance) that a Haar cascade simply cannot match. The trade-off flips in two situations. First, when the deployment target has no GPU and no NPU — a low-power embedded board running on battery, for example — a tuned Haar cascade can hit the latency budget where a CNN cannot. Second, when the input is already constrained (mugshot-style camera, fixed distance, controlled lighting), the marginal accuracy of a CNN may not justify the extra compute. Most production deployments are not in either of those situations, which is why CNN detectors dominate. Quick comparison: face detector families Detector family Typical latency (1080p) Strengths Where it fails Haar cascade (OpenCV) <10 ms CPU Cheap, no GPU needed Poor lighting, rotated faces, small scales MTCNN 30–60 ms GPU Strong recall, returns landmarks Slower than single-shot on multi-face RetinaFace / SCRFD 10–25 ms GPU High recall + landmarks in one pass Heavier model footprint YOLO-Face 5–15 ms GPU Throughput, batched inference Landmark quality varies Transformer hybrids 20–40 ms GPU Better on occlusion and crowd scenes Less mature tooling, higher VRAM These numbers are observed-pattern ranges from typical PyTorch / TensorRT deployments — not vendor benchmarks. The right detector is the one that hits your latency budget with your recall floor on your input distribution. What GPU acceleration is actually doing The marketing description of GPU acceleration is “thousands of cores running in parallel.” That is true and almost completely unhelpful for understanding what changes when you move a face detector from CPU to GPU. The relevant facts are these. A CNN forward pass is dominated by matrix multiplications inside convolutional layers. GPUs (and modern NPUs) execute those matmuls at one to two orders of magnitude higher throughput than a CPU because they can fuse the operations into a single kernel launch and pipeline memory access. When the detector is compiled with TensorRT or a similar graph optimiser, that gap widens further because constant folding, kernel fusion, and FP16/INT8 quantisation collapse the per-frame work. In our experience with computer-vision pipelines, the practical consequence is that the detector often stops being the bottleneck once it is on a GPU. The bottleneck moves to image decode (especially for H.264 / H.265 streams), to memory copies between host and device, or to the embedding stage downstream. Teams that benchmark only the detector miss this. We cover the systemic version of this problem in our piece on face detection camera systems accuracy, where the throughput ceiling is rarely where people expect it to be. Edge inference, cloud inference, and the latency-privacy trade-off Where the detector runs is a design decision, not a default. Three deployment shapes are common. On-device — the detector runs on the phone, camera, or kiosk itself. Apple’s Face ID is the canonical example: detection and matching both happen inside the Secure Enclave, and no facial data leaves the device. Latency is whatever the on-device NPU can deliver. Privacy posture is strong by construction. Edge — the detector runs on a local box near the cameras (an NVIDIA Jetson, an industrial PC with a discrete GPU, a small server in a closet). Cameras stream to the edge box, the box decodes and detects, and only metadata or matched events go to the cloud. This is the most common shape for serious commercial deployments because it preserves low latency while keeping raw video off the public internet. Cloud — the detector runs in a hyperscaler region. Cameras stream over the WAN to a service like Amazon Rekognition. This works for asynchronous use cases (forensic search, batch analytics) but is rarely the right shape for real-time alerting, because round-trip latency to the cloud and back is often comparable to the entire detection budget. The choice is not just about latency. GDPR, BIPA, and the EU AI Act’s risk-tier framing for biometric identification all push real-time face detection toward edge or on-device deployment, because the data-minimisation argument is structurally easier to make when frames never leave the local network. We discuss this governance angle further in facial recognition cameras commercial deployment. Where real-time face detection breaks in production A list of the failure modes we see most often: Decode bottleneck. The detector is fast enough; the H.264 decoder cannot keep up with the number of streams. Fix: hardware-accelerated decode (NVDEC) or fewer streams per box. Small-face recall collapse. The detector was trained on faces filling 10–30% of the frame; production faces are 2–5% of the frame at distance. Fix: multi-scale inference, or a detector explicitly tuned for small faces. False positives in cluttered scenes. Posters, advertising, and clothing patterns trigger detections. Fix: confidence threshold tuning per camera location; never a single global threshold. Motion blur on moving subjects. A 30 fps camera with a 1/30 s shutter blurs walking subjects. The detector still fires, but the downstream embedding is unusable. Fix: shorter shutter (requires more light), or rejection logic before embedding. Domain shift over time. Camera angles drift, lighting fixtures change, masks become more common. The detector that worked at deployment slowly degrades. Fix: scheduled re-evaluation against a held-out site-specific test set, not just the public benchmark. None of these are model problems in the strict sense. They are integration problems that only show up once the detector is running against real video, which is why we treat the full system as the unit of evaluation rather than the model in isolation. What to ask before committing to a real-time face detection design A short decision rubric we use with engagements: What is the latency budget per frame, and how was it derived? How many simultaneous streams must the system handle, and at what resolution and frame rate? What is the smallest face (in pixels) that must be detected? Where must the data live — on-device, on-prem edge, or cloud — and what regulation drives that choice? What is the acceptable false-positive rate at the operating threshold, and who pays the cost of a false positive? How will the system be re-evaluated as conditions drift? If any of these questions does not have a defensible answer before model selection starts, the project is choosing a detector against a budget that does not yet exist. That is the most reliable predictor of a face detection deployment that misses its goals. FAQ How TechnoLynx approaches real-time face detection We treat real-time face detection as a systems engineering problem rather than a model selection problem. That means scoping the latency budget, the stream count, the regulatory perimeter, and the failure cost before we recommend a detector. We work across PyTorch, TensorRT, ONNX Runtime, OpenCV, and the NVIDIA DeepStream toolchain, and we run engagements that own the outcome end-to-end — from camera placement and decode pipeline through detector selection, embedding, and gallery design. If the right answer is a Haar cascade on a CPU, we will say so. If it is a TensorRT-compiled SCRFD on a Jetson at the edge, we will scope and build that. Contact us to discuss a specific deployment. Most face detection projects that fail in production do so for one of two reasons: the latency budget was never grounded in the actual stream count and frame size, or the operating threshold was tuned on a benchmark distribution that did not match the deployed cameras. Both are detectable before model training begins. Neither is a model problem.