The comparison misframes the problem βCPU vs GPU for AIβ is framed as a competition, but modern AI workloads are not decided by choosing one or the other. They run on both. The question is which operations belong on each, and whether the boundary between them is causing performance problems. Understanding the actual decision helps avoid two common mistakes: trying to run everything on GPU when CPU operations bottleneck the pipeline, and benchmarking CPUs for workloads that are architecturally GPU-bound. What each processor is for in AI systems Operation CPU GPU Data loading and I/O Primary (I/O is not GPU-bound) No Tokenization and text preprocessing Primary Sometimes (GPU tokenizers exist) Data augmentation Primary (can be parallelized) Sometimes Matrix multiplication (model forward pass) Insufficient for large models Primary Attention computation Too slow at scale Primary (FlashAttention) Postprocessing / sampling CPU viable for small batches GPU for high throughput Control flow and orchestration CPU No The GPU handles the compute-intensive inner loop. The CPU handles everything else. If either is slower than the other can consume work from it, you have a bottleneck. Performance comparison For the same tensor operations: A modern CPU (e.g., AMD EPYC or Intel Xeon) achieves roughly 1β4 TFLOPS FP32 for optimized matrix operations A modern data center GPU achieves 100β1000 TFLOPS FP16/BF16 For AI model inference at batch sizes > 1, GPU dominates by 20β100Γ for compute-intensive models. At batch size 1 with small models and strict latency constraints, the gap narrows β CPU inference is viable for some use cases. When CPU inference is practical CPU-only inference is appropriate when: Model is small (< 1B parameters) Batch size is 1 (single request, synchronous) Latency requirement is > 100ms Hardware cost or power constraints preclude GPU ONNX Runtime and OpenVINO optimize CPU inference for these cases and can close some of the gap using AVX-512 and AMX instructions. CPU-side bottlenecks in AI benchmarking A benchmark that reports only GPU metrics misses half the picture. CPU-side preprocessing β tokenisation for LLMs, image decoding and augmentation for vision models, feature extraction for tabular data β can bottleneck the GPU if the data pipeline cannot feed tensors fast enough. When we see GPU utilisation below 80% during training, the cause is more often a CPU-side data loading bottleneck than a GPU scheduling problem. Diagnosing this requires measuring both CPU utilisation per core and GPU utilisation simultaneously. Tools like htop alongside nvidia-smi dmon reveal whether the GPU is idle waiting for data. If CPU utilisation on data-loading cores is at 100% while GPU utilisation dips periodically, increasing the number of data loader workers (in PyTorch: num_workers in DataLoader) or moving preprocessing to GPU (using DALI or torchvisionβs GPU transforms) typically resolves the bottleneck. For inference serving, the CPU-side overhead includes request parsing, tokenisation, batching logic, and response serialisation. On high-throughput inference servers, these operations can consume significant CPU cycles β we typically allocate at least 4 CPU cores per GPU for inference serving workloads to avoid this becoming the limiting factor. GPU performance is not one number GPU performance varies by workload type, batch size, and precision. Why GPU performance is not a single number explains why a single benchmark score does not characterize hardware for AI, which is the same reason CPU vs GPU comparisons using single scores produce misleading conclusions. How do you decide between CPU and GPU for a specific AI workload? The decision framework we use has three inputs: model size, latency requirement, and deployment scale. Small models (< 100M parameters) with relaxed latency requirements (> 200ms) at moderate scale (< 100 requests/second) are viable on CPU. Everything else benefits from GPU acceleration. CPU inference becomes compelling in specific edge deployment scenarios: embedded devices without GPU, cost-sensitive deployments where adding a GPU doubles hardware cost for marginal throughput improvement, and regulatory environments where GPU driver complexity adds validation burden. For models in the 100Mβ1B parameter range, ONNX Runtime on CPU with INT8 quantisation and AVX-512 optimisation delivers 10β50ms inference latency for classification and embedding tasks. This is competitive with GPU latency when accounting for host-to-device transfer overhead, which adds 0.5β2ms per inference on PCIe-connected GPUs. For latency-sensitive applications processing one request at a time (no batching), CPU inference can actually be faster than GPU inference for small models. Above 1B parameters, GPU acceleration is necessary for practical inference speeds. The memory bandwidth advantage of HBM (2β3 TB/s on data centre GPUs) over DDR5 (50β100 GB/s on CPUs) creates a throughput gap that grows with model size. For a 7B parameter LLM, CPU inference generates 1β3 tokens/second while GPU inference generates 30β100 tokens/second. The 10β30Γ throughput difference makes CPU deployment impractical for interactive applications.