Phoronix provides reproducible AI-relevant GPU benchmarks Unlike Geekbench or 3DMark, the Phoronix Test Suite (PTS) ships test profiles that exercise actual AI framework code: TensorFlow training benchmarks, PyTorch inference tests, ONNX Runtime profiles, and quantised LLM runs through llama.cpp. For comparing GPU hardware in a documented, reproducible way, PTS is more relevant to AI than most consumer benchmark alternatives — provided you read the numbers correctly. The trap is treating a PTS score as a forecast of production throughput. It is not. It is a controlled snapshot of one fixed workload running on a particular stack, and the gap between that snapshot and what your inference service will actually see can be larger than the gap between two competing GPUs. We run PTS regularly in our engagements, mostly as a driver and stack validation tool. That framing — scaffolding, not verdict — is the one most teams miss. Setting up Phoronix for GPU AI testing # Install Phoronix Test Suite wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.8.4.tar.gz tar xzf phoronix-test-suite-10.8.4.tar.gz cd phoronix-test-suite sudo ./install-sh # Run TensorFlow benchmark phoronix-test-suite benchmark tensorflow # Run PyTorch benchmark phoronix-test-suite benchmark pytorch # Run ONNX Runtime benchmark phoronix-test-suite benchmark onnxruntime Each profile pins a model, a batch shape, and a precision. That pinning is what makes the run reproducible — and also what makes it narrow. The profile that runs on your test node is the same profile that ran on the published comparison node, which is exactly the property production workloads do not have. What are the key AI-relevant Phoronix test profiles? Profile Model tested Metric What it measures tensorflow-benchmark ResNet-50 Images/second Training throughput, fixed batch pytorch-benchmark ResNet-50, BERT Items/second Training/inference, fixed batch onnxruntime ResNet-50 Latency/throughput Inference framework path llama.cpp Quantised LLM Tokens/second CPU+GPU LLM inference Note the pattern: every profile is single-stream, fixed batch, fixed model. That is the right shape for a reproducible test. It is the wrong shape for predicting an inference service that sees concurrent requests, variable sequence lengths, and queuing. Interpreting Phoronix GPU benchmark results A few interpretive rules we apply on every PTS result we read. ResNet-50 training throughput is a useful relative comparison across training infrastructure but does not transfer cleanly to modern architectures. ViTs, diffusion U-Nets, and decoder-only transformers stress different parts of the GPU — attention kernels, memory bandwidth into HBM, tensor-core utilisation under FlashAttention — in different proportions than ResNet’s convolutional stack. A GPU that wins on ResNet-50 by a comfortable margin can lose on long-context attention because the bottleneck has moved. ONNX Runtime inference tests use smaller models and batch sizes than production. GPU efficiency at batch=1 versus batch=32 is non-linear, and the curve is model-dependent. Scaling a PTS latency number linearly to production batch size is one of the most common misreadings we see. Cross-submission comparisons require matching the software environment. PTS publishes community benchmark results, and the temptation is to compare your number to the leaderboard. Driver version, CUDA version, cuDNN version, and framework version each move the result. In our experience, differences in the software stack alone produce roughly 20–40% variation on identical hardware (observed pattern across the engagements where we have pinned hardware and varied stacks; not a published benchmark). Quick-answer block: what PTS does and does not tell you Question PTS answer quality Is my GPU driver stack functional end-to-end? High — a failing PTS run reliably indicates a problem Will this GPU outperform that GPU on my workload? Low — only on PTS’s fixed workload, not yours How will the GPU scale to my batch and concurrency? None — PTS is single-stream, fixed batch Did a driver update regress AI throughput? High — controlled before/after on identical hardware What is my absolute production throughput? None — PTS measures a proxy workload This is the discipline that separates useful PTS use from misleading PTS use. The suite is a controlled-environment instrument. It is not a production predictor. What does a Phoronix GPU test tell you about AI readiness? Phoronix’s GPU benchmarks fall into three families: OpenGL rendering (Unigine, GpuTest), Vulkan compute (vkpeak), and framework-specific AI tests (PyTorch, TensorFlow, ONNX Runtime, llama.cpp). The AI-specific tests are the only ones that predict AI workload performance with reasonable accuracy — and even those measure a narrow slice of the surface a production system actually traverses. The PyTorch benchmark in PTS runs ResNet-50 inference at a fixed batch size. That tells you whether the GPU, driver, and CUDA/cuDNN stack are correctly installed and functioning. It does not tell you how the GPU will behave on your specific model architecture, sequence length, or batch configuration. A Stable Diffusion run, an LLM inference run, and a ResNet-50 inference run stress different subsystems — compute units, memory bandwidth, tensor cores, the attention kernel path — in different proportions. The PTS profile gives you one point in a high-dimensional space. We use PTS primarily as a driver validation tool. After installing or updating NVIDIA drivers on Linux, running the PTS PyTorch test confirms that the full chain — driver, CUDA runtime, cuDNN, PyTorch, model execution — is functional. A passing PTS result does not guarantee production readiness. A failing PTS result reliably indicates a stack problem. That asymmetry is the useful property. For cross-vendor comparison (NVIDIA versus AMD), PTS provides a controlled environment where both vendors run the same test code. This eliminates the software-stack variable that confounds most ad-hoc cross-vendor comparisons. The catch: PTS framework tests typically do not use vendor-specific optimisations — no FlashAttention on the NVIDIA side, no MIOpen tuning on the AMD side — so the results reflect unoptimised baseline performance rather than what a production-tuned deployment would reach. In rough terms, we see the gap between PTS results and production-tuned performance run at roughly 20–40% on NVIDIA (where framework optimisations are mature) and 40–60% on AMD (where additional tuning effort is required). These are observed patterns from our deployments, not published benchmarks. That is enough margin to make PTS useful for sanity-check ballpark comparison and unreliable for procurement decisions where the difference between two configurations is being weighed in single-digit percent. Comparing PTS results across driver versions One under-used application of PTS is tracking AI performance across driver updates. Running the same profile before and after a driver update on identical hardware produces a controlled comparison that isolates the driver’s performance impact — exactly the kind of variable that disappears into noise in production traffic. We maintain a PTS result database for our production GPU configurations. When evaluating a driver update — for example, moving from a 535.x line to a 550.x line — we run the PTS PyTorch and TensorFlow profiles on a test node before updating, then again after. A throughput change of more than 3% triggers investigation: either the new driver has introduced a regression (which we then report upstream) or it has enabled an optimisation worth understanding before rolling forward. This approach has caught three significant driver regressions before they reached production over an 18-month window. In each case the PTS test showed a 5–12% throughput drop that had been invisible in manual production observation because it fell inside the normal variation of live traffic. The controlled, identical-workload comparison made the regression visible (observed in our own engagement history; not a benchmark you can run against our environment). Why the PTS number is not the production number This is the crux. PTS pins the workload. Production does not. A real inference service runs multiple concurrent requests with variable arrival times. It queues. It batches dynamically. Sequence lengths vary across requests; KV-cache sizes vary across the lifetime of a session. Memory pressure interacts with concurrency in ways that single-stream throughput cannot reveal. The PTS number describes the GPU’s behaviour on one fixed point in that space. The production number is an integral over a distribution of points, weighted by the actual traffic pattern. That structural mismatch is why otherwise-honest published benchmark numbers — PTS or otherwise — routinely fail to predict what a team sees in production. The benchmark is not wrong. The benchmark is answering a different question than the one procurement is asking. Realism is not a binary property the benchmark either has or lacks; it is a question of how close the benchmark’s workload shape sits to the production workload shape on the axes that matter for the architecture in question. For an extended treatment of that mismatch, why benchmarks fail to match real AI workloads covers the structural gap in more detail. When PTS is the right tool PTS earns its place when the question matches its shape: Driver and stack validation. After any change to the driver, CUDA, or cuDNN, a fixed PTS profile is the cheapest end-to-end confirmation that the chain still functions. Regression detection across software updates. Identical hardware, identical profile, before-and-after — the controlled comparison PTS makes possible is hard to get any other way. Baseline documentation for a new deployment. Recording PTS results for a freshly provisioned node creates a reference point for the next time something looks off. Cross-vendor sanity check. Same test code on NVIDIA and AMD, unoptimised path — useful for ballpark, not for procurement margin. PTS is the wrong tool when the question is “how will this GPU perform on my model under my traffic?” That question requires running your model under your traffic, on the candidate hardware, with the precision and executor configuration that production will actually use. FAQ How do concurrency, queuing, and request variability change observed performance versus a single-stream benchmark? They dominate it. Single-stream throughput sets an upper bound that production rarely reaches, because production must trade latency against batching, absorb arrival-rate variance, and manage KV-cache memory across concurrent sessions. A GPU that looks 30% faster on PTS can be a few percent faster, equal, or even slower in production once the queue depth and batching policy are in play. LynxBench AI treats the Phoronix GPU test profile as a reproducibility scaffold for an AI workload, not as the workload itself. The AI-relevant property is what actually executes on the GPU under a specified precision regime and stack; PTS is a vehicle for recording that pinning so a third party can rebuild the run. When evaluating any Phoronix GPU AI result, ask: are the sustained-load workload, precision regime, and AI Executor configuration pinned in the test profile clearly enough for a third party to rebuild the run under the same thermal envelope — or was the suite used to publish a number whose AI definition is opaque?