Generic benchmarks answer the wrong question for AI Running Geekbench or 3DMark on a PC you intend to use for AI workloads answers a question nobody asked. These benchmarks measure compute throughput on standardized tasks that have very little in common with what AI inference and training actually do — namely, move large tensors across memory hierarchies, sustain matrix-multiplication kernels for minutes at a time, and exercise the software stack (driver, CUDA, cuDNN, framework) in ways a synthetic 30-second test never touches. A benchmark for AI has to measure execution, not specification. That is the practical consequence of treating GPU performance as an execution property of a running system rather than a static property of the hardware: you have to run representative workloads, on the actual stack you intend to deploy, for long enough that the real bottleneck reveals itself. Everything below is the protocol we run when a team asks whether a particular box is fit for purpose. The three dimensions that actually matter AI workloads are constrained by the interaction between compute, memory, and the ability to hold both under load. Any benchmark protocol that omits one of these dimensions produces misleading numbers — usually flattering ones, because synthetic burst tests light up compute peaks while bandwidth and thermals are still cold. 1. Peak compute throughput (TFLOPS) Test whether the hardware can sustain the matrix operations AI models require, at the precision your workload uses. FP32 for training without mixed precision, FP16 or BF16 for most modern training, INT8 for production inference. PyTorch on top of CUDA and cuDNN is enough to drive this directly: import torch, time a = torch.randn(8192, 8192, device='cuda', dtype=torch.float16) b = torch.randn(8192, 8192, device='cuda', dtype=torch.float16) for _ in range(5): torch.matmul(a, b) torch.cuda.synchronize() start = time.time() for _ in range(100): torch.matmul(a, b) torch.cuda.synchronize() elapsed = time.time() - start flops = 2 * 8192**3 * 100 print(f'{flops / elapsed / 1e12:.1f} TFLOPS FP16') What you are looking for is the achieved TFLOPS as a fraction of the GPU’s published peak. In configurations we’ve tested, a well-configured RTX-class workstation will land between 60% and 80% of the spec-sheet number on FP16 GEMM; anything below 60% means kernels, drivers, or thermals are leaving compute on the table, not that the silicon is broken. 2. Memory bandwidth at realistic tensor sizes Most AI inference is memory-bandwidth-bound, not compute-bound. Bandwidth on the tensor shapes your model actually uses is the relevant number — vendor specs quote streaming bandwidth on idealised access patterns that no real model produces. import torch, time weights = torch.randn(4096, 4096, device='cuda', dtype=torch.float16) torch.cuda.synchronize() start = time.time() for _ in range(1000): result = weights.sum() torch.cuda.synchronize() elapsed = time.time() - start bytes_read = 4096 * 4096 * 2 * 1000 print(f'{bytes_read / elapsed / 1e9:.0f} GB/s effective bandwidth') A useful threshold (observed-pattern, not a benchmarked rate): effective bandwidth under 50% of the published HBM or GDDR figure typically points at host-to-device transfer overhead rather than the memory subsystem itself. 3. Sustained performance under load Burst performance tells you what the GPU can do in the first 60 seconds. Sustained performance tells you what it can do during the next 8 hours of production traffic. The two diverge for thermal, power-budget, and driver-management reasons that no spec sheet exposes. Run inference for 15 minutes continuously and compare throughput at minute 1, 5, 10, and 15. A sustained-to-peak ratio below 0.85 is the threshold at which thermal throttling becomes the dominant bottleneck — at that point cooling, not the GPU, is the constraint. How should I read the numbers? The three measurements above are only useful against thresholds. The table below is the reference card we use internally when interpreting a fresh run: Metric Good Acceptable Investigate FP16 TFLOPS vs spec-sheet peak > 80% 60–80% < 60% Effective memory bandwidth vs peak > 70% 50–70% < 50% Sustained / burst throughput ratio > 0.90 0.85–0.90 < 0.85 Two GPUs with nearly identical spec sheets routinely land in different columns of this table on the same model, for reasons that are not visible from the spec sheet — driver version, cooling solution, PCIe topology, power limit, the framework’s kernel choices. That is the empirical version of the broader argument that GPU performance for AI is an execution property, not a static one. What does a complete benchmark protocol include? Three measurements are necessary but not sufficient. A complete protocol has four phases — hardware validation, software stack verification, burst performance, sustained performance — and skipping any of them produces incomplete data. Hardware validation. Confirm that the GPU reports the expected model, memory capacity, and driver version via nvidia-smi. Check that PCIe link speed matches the card’s capability — Gen4 x16 for current GPUs; a Gen3 link halves host-to-device transfer bandwidth and silently caps every benchmark downstream. For multi-GPU systems, verify NVLink is active with nvidia-smi nvlink --status. We have seen entire procurement cycles invalidated because a chassis seated the card in a Gen3 slot. Software stack verification. Run a known-good model — ResNet-50 or BERT-base — and compare throughput against published baselines for the same hardware. If throughput is more than 10% below baseline, the stack is misconfigured. Common causes are well-known: wrong CUDA version, missing cuDNN, PyTorch built without GPU support, or aggressive power management capping the GPU clock. None of these will be obvious from the GPU’s own diagnostics. Burst performance. Run the target workload for two minutes and record peak throughput. This is the ceiling the hardware can reach before thermals take effect. For desktop and workstation GPUs, burst performance usually matches the boost-clock specification — that is what spec sheets are actually describing. Sustained performance. Run the same workload for 30 minutes and record throughput every 60 seconds. The steady-state value, typically reached after 5 to 10 minutes, is what determines production capacity. We report the ratio of sustained to burst — our “sustain ratio” — as the primary number. Values below 0.8 indicate insufficient cooling for the workload’s power consumption, and no amount of kernel tuning will recover what thermals take away. This is the practical face of the gap between peak and steady-state behaviour. Documenting results so they remain useful Benchmark numbers without documentation are one-time observations. Documented results become reusable engineering data. The template we use captures four blocks: hardware specification (GPU model, VRAM, CPU, RAM, storage type), software stack (OS, driver, CUDA version, framework version), test configuration (model, batch size, sequence length, precision), and the results themselves (burst throughput, sustained throughput, P50/P95/P99 latency, power consumption). Store the results in a version-controlled file — CSV is enough — alongside the benchmark scripts that produced them. This builds a searchable history of hardware performance across configurations and time. When evaluating new hardware, the honest comparison anchor is the team’s own record on the same workload, not a vendor number measured under undisclosed conditions. The overhead is roughly five minutes per run. That is a negligible cost to convert an ephemeral test into durable engineering knowledge. When the next “how does the 4090 compare to an A100 for our workload?” question lands, the answer comes from data rather than memory. Why benchmarking is a protocol, not an event The most common mistake in AI benchmarking is treating it as a one-time activity. Hardware drifts. Thermal paste degrades, dust accumulates in cooling stacks, drivers and frameworks change kernel selection, and the OS scheduler reshapes contention with every update. We re-benchmark production hardware quarterly against the original baseline. A 5% quarterly decline in sustained throughput — invisible against daily monitoring noise — compounds to a 19% annual reduction in effective capacity. Periodic benchmarking catches that drift while it is still cheap to fix. This is the design intent behind LynxBench AI: treat the benchmark as a continuously executed protocol re-run across driver, framework, and OS updates against a versioned baseline, because hardware-as-measured drifts with software and a one-time baseline becomes stale inside a single release cycle. The question to put to any AI-PC benchmark protocol: does the methodology specify how often and against what baseline the protocol is re-run as the model, driver, and operating point drift — or is the result being treated as a static property of the box?