Linux CPU benchmarks for AI: synthetic scores rarely predict pipeline throughput A Geekbench score tells you almost nothing useful about whether a Linux box will keep a GPU fed during training. Phoronix runs, SPEC CPU numbers, and sysbench loops measure standardised compute kernels under conditions that look nothing like an AI preprocessing pipeline. For systems where the CPU’s job is to decode images, tokenise text, run NumPy transforms, and shovel tensors across PCIe, those benchmarks are at best a sanity check that the silicon isn’t broken. The CPU’s role in an AI system is best understood as the upstream supplier to a downstream consumer that costs roughly an order of magnitude more per hour. When the supplier can’t keep up, the GPU sits idle and the system-limited pattern we cover in GPUs are part of a larger system takes over. The useful question is not “how fast is this CPU on a generic compute test?” — it’s “how many samples per second can this CPU push through the actual preprocessing graph this workload needs, under the kernel and library configuration that will run in production?” What the CPU actually does in an AI pipeline Operation CPU‑bound Parallelisable Benchmark approach Image decoding (JPEG/PNG) Yes Per‑image Decode throughput test Text tokenisation Yes Per‑document Tokeniser throughput test NumPy / SciPy preprocessing Yes Per‑batch Benchmark actual operations Data augmentation Yes Per‑sample Augmentation throughput test PCIe data transfer to GPU I/O‑bound Limited Bandwidth test Python control overhead Yes No Profiling (cProfile, py-spy) Read this table as a decomposition of the preprocessing stage rather than a sequence. In a PyTorch DataLoader with num_workers > 0, several of these run concurrently in worker processes while the main process orchestrates batching and host-to-device transfer. The CPU benchmark question becomes: which of these rows is the binding constraint on the specific workload, and which Linux measurement actually exposes that row’s throughput? Synthetic CPU benchmarks: useful for what they are, misleading for AI Synthetic Linux CPU benchmarks have a legitimate role — they detect throttling, validate that a CPU is not obviously underpowered or thermally compromised, and give a portable number for procurement conversations. They become misleading the moment that number is treated as a predictor of AI pipeline throughput. Geekbench reports single-core and multi-core scores built from a mix of compression, image filtering, and machine-learning-flavoured kernels. It’s a reasonable proxy for “is this CPU broadly competitive in 2026?” and a poor proxy for “how fast will my JPEG decode plus augmentation pipeline run?” sysbench is more useful for diagnostics than ranking. Its CPU mode hammers integer and floating-point operations and reliably surfaces thermal throttling under sustained load. Its memory mode gives a first-cut bandwidth number that, while not as rigorous as STREAM, is fast to run and good for spotting misconfigured DIMM populations. # CPU throughput under sustained load — useful for throttling detection sysbench cpu --cpu-max-prime=20000 --threads=$(nproc) run # First-cut memory bandwidth sysbench memory --memory-block-size=1M --memory-total-size=100G run STREAM is the one synthetic that earns its place in an AI benchmark plan, because memory bandwidth genuinely binds the operations that dominate preprocessing — large tensor moves, contiguous array transforms, and copy-heavy augmentation. Its copy/scale/add/triad pattern approximates the kind of streaming access an image pipeline produces. # STREAM memory bandwidth — directly relevant to preprocessing throughput wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c gcc -O3 -fopenmp stream.c -o stream ./stream mbw is a lighter alternative that’s easier to run repeatedly during NUMA tuning. In our experience, mbw results correlate more closely with DataLoader throughput on memory-bound preprocessing than any general-purpose CPU score does — this is an observed pattern across several engagements rather than a benchmarked rate published anywhere. Workload-specific benchmarks: time the pipeline you actually run The most honest Linux CPU benchmark for AI is a timed run of the real preprocessing pipeline, measured independently of the GPU. A few lines of Python expose more than any synthetic ever will: import time from PIL import Image import numpy as np # Image decode throughput, isolated from the GPU images = ['test.jpg'] * 1000 start = time.time() for img_path in images: img = Image.open(img_path) arr = np.array(img) elapsed = time.time() - start print(f'{len(images)/elapsed:.0f} images/sec') Wrap the real pipeline with cProfile or py-spy and the breakdown becomes concrete: how much time is in JPEG decode versus resize versus the random crop versus the host-to-device copy. That breakdown is the benchmark. Everything else is an indirect estimate. The decision rule we apply in practice: if the CPU produces batches faster than the GPU consumes them by at least 2×, CPU performance is not the binding constraint, regardless of what any synthetic score reports. If the margin is below 1.2×, the CPU is the bottleneck and any GPU upgrade will hit a wall. Between those two thresholds, the system is sensitive to load variance and worth tuning. Why NUMA topology dominates multi-socket Linux servers NUMA matters more than most benchmark suites admit. On a dual-socket server, a PyTorch DataLoader worker running on CPU 0 that reads from memory attached to CPU 1 pays an inter-socket bandwidth penalty in the 30–60% range — an observed pattern across the multi-socket deployments we’ve measured, not a single-source benchmark. The fix is straightforward once the topology is mapped: # Map the NUMA layout first numactl --hardware # Bind workers to a single NUMA node numactl --cpunodebind=0 --membind=0 python train.py In one engagement on a dual-socket EPYC 7763 system, local-node memory bandwidth measured roughly 190 GB/s against approximately 85 GB/s remote — a 2.2× gap. Binding DataLoader workers to the local NUMA node reduced per-batch image preprocessing time from around 4.2 ms to 2.1 ms on ImageNet-scale inputs. Those figures are operational measurements from that specific configuration, not a transferable benchmark — a different CPU generation, BIOS revision, or kernel version will produce different absolute numbers, though the structural pattern is consistent. The connection to the broader system-balance story is direct: PCIe topology and interconnect choice sit downstream of NUMA. A worker bound to the wrong node not only loses memory bandwidth, it also takes a longer route to the GPU’s PCIe root complex. Both penalties compound, and both register as “GPU is slow” in the dashboards without naming the actual cause. Compiler, library, and kernel choices: the configuration that benchmarks hide A Linux CPU number quoted without the build and tuning configuration is a number without context. Three sources of variation matter enough to swamp small CPU model differences. Numerical library linkage. PyTorch built against Intel MKL typically runs batch matrix multiplications and convolutions roughly 15–30% faster than the same PyTorch built against generic OpenBLAS on Intel CPUs — a range we’ve seen across multiple builds, framed here as an observed pattern rather than a single benchmark. On AMD CPUs, BLIS or OpenBLAS with AMD-specific tuning is competitive with MKL. The diagnostic is one line: import torch print(torch.__config__.show()) If MKL is listed in the build but not installed at runtime, PyTorch falls back silently to a slower internal path. We check this before investigating anything more exotic, because it explains a surprising fraction of “this server is slow” tickets. SIMD-accelerated preprocessing. Pillow-SIMD, a drop-in fork of Pillow, runs image decode and resize roughly 4–6× faster than stock Pillow on AVX-2 and AVX-512 CPUs (figures consistent with the project’s published benchmarks and with our observed runs). For preprocessing-heavy training, this single substitution often moves the CPU/GPU ratio from “starved GPU” to “comfortable margin” without changing hardware. Kernel and BIOS tuning. Transparent Huge Pages, CPU governor, I/O scheduler, IRQ affinity, and BIOS power-cap settings each contribute measurable differences that no synthetic benchmark captures. The most reliable single change is the governor: # Pin the governor to performance to prevent clock fluctuation cpupower frequency-set -g performance In our experience, the performance governor adds roughly 3–8% on sustained preprocessing throughput compared with ondemand or powersave, simply by preventing the clock from sagging during steady-state load. The effect is larger on laptops and small-form-factor machines than on tuned server platforms. A practical Linux CPU benchmark plan for AI The plan we apply when scoping or auditing a Linux AI box: Run STREAM and mbw to characterise sustained memory bandwidth, both single-node and inter-node where multi-socket. Run sysbench cpu under sustained load to confirm the CPU does not throttle. Map NUMA topology with numactl --hardware and confirm workers are bound to the correct node. Time the actual preprocessing pipeline end-to-end with time plus a cProfile or py-spy breakdown. Verify library linkage with torch.__config__.show() and confirm MKL / BLIS / Pillow-SIMD are present where expected. Compare CPU samples-per-second to GPU samples-per-second and apply the 2× margin rule. Record kernel, governor, THP setting, library versions, and BIOS tuning state alongside every number. Step 7 is the part most published Linux CPU AI benchmarks skip, and it’s the part that determines whether the number transfers. LynxBench AI treats Linux CPU evaluation for AI as workload-and-tuning-bound — kernel, scheduler, NUMA, IRQ affinity, BIOS settings — because the same CPU under different Linux configurations produces materially different pipeline throughput. FAQ Why is GPU utilisation frequently capped by the system around the GPU rather than the GPU itself? GPU utilisation as reported by nvidia-smi measures whether any kernel is active, not whether the GPU is doing useful work at full throughput. In practice, the GPU is often waiting on the next batch from the CPU, on a PCIe transfer to complete, or on a collective operation to finish across nodes. These waits register as “utilisation” if a kernel is technically scheduled but show up as low samples-per-second at the workload level. The structural pattern is covered in more depth in our writing on why GPUs are part of a larger system. Before accepting any published Linux-CPU AI benchmark as evidence: are the kernel version, tuning configuration, and userspace stack disclosed and reproducible on the deployment host, or does the published number reflect a harness production will never replicate?