Synthetic CPU tests miss the AI pipeline bottleneck Standard Linux CPU performance tests — sysbench, stress-ng, Geekbench — measure CPU compute throughput on synthetic workloads. When debugging an AI training pipeline that appears GPU-underutilized, these tests tell you almost nothing useful, because the bottleneck is rarely raw CPU compute. It is specific pipeline operations: data loading from disk, image decode, augmentation, tokenisation, and Python control overhead in the training loop. Finding a CPU bottleneck in an AI pipeline requires profiling the actual pipeline, not running a synthetic test that happens to use the CPU. The profiling sequence below is what we run when a customer’s GPU is sitting at 40% utilization and nobody knows why. Step 1: Check the GPU utilization pattern The first signal is shape, not magnitude. A flat 60% utilization curve and a sawtooth curve that swings between 0% and 95% have completely different causes, and you cannot tell them apart from an averaged dashboard number. # Watch GPU utilization over time at 1-second intervals watch -n 1 nvidia-smi # Or record to file for analysis nvidia-smi dmon -s u -d 1 > gpu_util.log & If GPU utilization cycles between high and low at the cadence of batch processing, the GPU is waiting for the CPU to supply the next batch. That is the most common CPU bottleneck pattern in deep-learning training, and it is the one synthetic CPU benchmarks are least equipped to predict. For the broader argument about why average utilization is the wrong number to optimise against, see our piece on why GPU utilization is not performance. Step 2: Profile the data loading pipeline in isolation Before blaming the CPU, measure how fast the data pipeline can produce batches with the GPU completely out of the picture. This is an observed pattern across customer engagements: in our experience, more than half of “CPU bottleneck” reports turn out to be storage or single-threaded decode, not CPU compute saturation. import torch from torch.utils.data import DataLoader import time # Test DataLoader throughput in isolation (without GPU) loader = DataLoader(dataset, batch_size=32, num_workers=4, pin_memory=True) start = time.time() for i, batch in enumerate(loader): if i == 100: break elapsed = time.time() - start print(f'DataLoader throughput: {100 * 32 / elapsed:.0f} samples/sec') If DataLoader throughput is less than roughly 2× training throughput, data loading is the binding constraint. That headroom matters because GPU step time has its own variance; a DataLoader sitting at 1.1× will starve the GPU on the unlucky steps even if the average looks fine. Step 3: Profile specific CPU operations # Profile Python process CPU usage py-spy record -o profile.svg --pid <training_pid> # Or use cProfile within the script python -m cProfile -o profile.stats training_script.py python -c "import pstats; p = pstats.Stats('profile.stats'); p.sort_stats('cumtime'); p.print_stats(20)" py-spy is non-invasive — it attaches to a running training process without modifying the code — and produces a flame graph that highlights where wall-clock time is being spent. cProfile is heavier but gives deterministic call-graph numbers when the workload is reproducible. What are the common CPU bottleneck sources in AI pipelines? Source Symptom Fix Insufficient DataLoader workers GPU idles between batches Increase num_workers Single-threaded preprocessing High single-core CPU, GPU idles Vectorize or parallelize preprocessing Python overhead in training loop Per-step Python operations dominate Use torch.compile or minimize Python in the loop Tokenization not batched CPU 100% single-core Batch tokenize before the DataLoader Image decode in main thread DataLoader throughput low Move decode to worker processes This table is the diagnostic surface we work through in order. Each row corresponds to a specific measurement: num_workers shows up immediately in DataLoader throughput; single-threaded preprocessing shows up as one core pegged at 100% while the others idle; Python loop overhead appears as a wide flat plateau in py-spy. CPU performance tests that are AI-relevant If you do want a synthetic comparison, benchmark the operations the pipeline actually performs — memory bandwidth, numpy-shaped matrix workloads, image decode — rather than abstract CPU compute. # Memory bandwidth (relevant for data loading) apt install mbw && mbw 1024 # Multi-core floating point (relevant for numpy preprocessing) python -c " import numpy as np import time a = np.random.randn(10000, 10000) start = time.time() for _ in range(100): result = a @ a[:100] print(f'{100/( time.time()-start):.0f} ops/sec') " Memory bandwidth is the one that catches teams off guard most often. A CPU with strong single-thread performance but constrained memory channels can decode and augment images at half the rate a benchmark like Geekbench would predict, because real preprocessing is bandwidth-bound rather than compute-bound. This is an observed pattern across the engagements where we have measured both numbers on the same host. How do you isolate CPU bottlenecks in an AI pipeline? CPU bottleneck isolation requires measuring CPU-stage throughput independently of GPU throughput. The technique: disable GPU processing — replace the model forward pass with a no-op that returns immediately — and measure how many batches per second the CPU stages produce. This CPU-only throughput is the maximum rate the pipeline can feed the GPU. If CPU-only throughput is less than 2× the GPU’s batch processing rate, the CPU is a current or near-future bottleneck. If CPU-only throughput is less than 1× the GPU’s rate, the CPU is already the binding constraint and GPU hardware is being wasted on a starvation pattern that no GPU upgrade will fix. Common CPU bottlenecks in AI pipelines on Linux: single-threaded data loading (fix: increase num_workers in DataLoader), GIL contention in preprocessing (fix: use multiprocessing, not threading), unoptimised image decode (fix: use libjpeg-turbo or NVIDIA DALI), and inefficient tokenisation (fix: use Rust-based tokenisers like HuggingFace’s tokenizers library). We profile CPU bottlenecks using py-spy for Python-level profiling and perf record for system-level profiling. py-spy shows which Python functions consume wall-clock time; perf shows which CPU instructions, and thus which libraries, consume cycles. The combination identifies both the Python-level bottleneck (image augmentation takes 60% of preprocessing time) and the system-level cause (PIL’s nearest-neighbour resize instead of OpenCV’s SIMD-accelerated resize). Quantifying the CPU contribution to end-to-end AI performance The CPU’s contribution to AI pipeline performance is often underestimated because GPU metrics dominate monitoring dashboards. To quantify the CPU’s impact, we measure two ratios: the preprocessing ratio (time in CPU preprocessing / time in GPU inference) and the data loading ratio (time loading data from storage / time in GPU inference). A preprocessing ratio above 0.5 means the CPU is consuming more than half the time budget. For computer vision pipelines with heavy augmentation — random crop, colour jitter, geometric transforms — preprocessing ratios of roughly 0.8–1.5 are common on single-threaded implementations (observed pattern across our engagements, not a benchmarked rate). Moving augmentation to GPU via NVIDIA DALI or Kornia, or parallelising across CPU cores with a multiprocessing DataLoader, typically reduces this ratio to the 0.1–0.3 range. A data loading ratio above 0.3 indicates a storage bottleneck rather than a CPU one. NVMe SSDs provide roughly 3–7 GB/s sequential read throughput per published vendor specifications, which is sufficient for most training pipelines. SATA SSDs at around 500 MB/s, or network storage at variable throughput, can create loading bottlenecks that no amount of CPU or GPU optimisation can resolve. We verify storage throughput independently using fio before investigating other pipeline stages — it takes ten minutes and rules out an entire class of false leads. LynxBench AI treats the Linux configuration — kernel version, scheduler, NUMA policy, IRQ pinning, storage subsystem — as part of the AI Executor specification, because pipeline throughput on the same CPU silicon shifts measurably with these settings, and benchmarks that omit them produce results that do not transfer between hosts. The question to put to any Linux-CPU AI pipeline benchmark is whether the kernel and tuning state are disclosed and held constant across the comparison, or whether the published number reflects an OS configuration the deployment will not reproduce. Stripped of the kernel and tuning context, is the published CPU-pipeline number the binding constraint on your deployment’s sustained throughput, or a figure produced under a Linux configuration the production host will never rebuild?