“Same GPU” is not the equivalence class people think it is Two physical GPUs of the same model run the same benchmark. The numbers come back different. The instinct is to look for a fault — defective unit, bad thermal paste, suspicious silicon. Usually there’s no fault. The model number on the box is a hardware identity; it is not a performance contract. The performance the workload achieves is a property of the AI Executor — accelerator plus driver plus runtime plus framework plus precision plus host plus thermal envelope — and “same model number” holds constant only the first item in that list. Treating the model number as a performance contract produces two predictable failures: chasing phantom hardware faults that aren’t there, and reading benchmark differences as more meaningful than they are. We see both regularly when teams ask us to look at a “GPU problem” that turns out to live two or three layers up the stack. What changes when the “same GPU” sits in two different hosts? The hardware identity holds. Almost everything else can shift. The table below lists the axes that, in our experience, account for nearly all of the observed variance between two nominally identical accelerators: Axis Why it changes per host Driver version Different install dates, different distro update cadence CUDA / runtime version Framework wheels vendor different toolkits; system installs differ Framework version + build Different wheel sources (PyPI, conda-forge, NGC), different dependency resolutions Kernel libraries (cuDNN, cuBLAS, NCCL) Vendored per framework wheel; a system install can shadow the vendored copy OS kernel version Different distros, different update windows PCIe topology Slot generation, lane width, switch chip presence on motherboard CPU and host memory Affects host-side preprocessing, dataloader throughput Cooling configuration Server form factor, fan curves, ambient temperature Power-cap policy Vendor power caps configurable per host (nvidia-smi -pl) Co-tenant load Other workloads competing for memory bandwidth, network, storage Workload shape / batch / precision Operator-controlled, not always held constant in casual comparisons Any of these can shift observed performance. Several typically do, and the effects compose. A benchmark difference between two hosts running the same GPU model is the natural consequence of holding only the silicon constant while letting the rest of the executor vary. The silicon-side variance from manufacturing tolerances is small for modern AI accelerators — typically well below what executor-level differences contribute. That is an observed pattern across the hosts we’ve profiled; it is not a benchmarked rate, and the exact ratio depends on the workload’s sensitivity to memory bandwidth, kernel selection, and precision. The point is directional: when two same-model accelerators disagree, the silicon is almost never where the disagreement lives. When is variance a system difference rather than a hardware fault? This is the diagnostic question that matters most, because the answer determines which layer of the stack the investigation should touch. The short version: variance is a hardware fault only after the executor configuration has been held constant and the variance persists. Until then, variance is evidence about the executor, not about the silicon. A workable narrowing sequence: Lock the workload. Same model, same batch, same precision, same input distribution, same warm-up policy, same measurement window. Casual comparisons almost always vary at least one of these. Lock the framework and kernel libraries. Install the same framework wheel on both hosts. PyTorch, TensorFlow, or JAX builds vendor their own CUDA toolkit, cuDNN, and (often) NCCL — a difference of a single minor version in the framework’s vendored cuDNN can move attention-kernel throughput noticeably. Lock the driver and runtime. Match NVIDIA driver versions and confirm the CUDA runtime the framework actually loads (which is usually the vendored one, not the system install). Lock the power and thermal envelope. Check nvidia-smi -q for power caps, persistence mode, clock-throttle reasons. Two GPUs at different ambient temperatures will clock differently long before any thermal alarm fires. Lock the host-side contributors. PCIe topology, NUMA placement, dataloader worker count, and co-tenant load. If the workload is bandwidth-bound on host-to-device transfer, two different motherboards will produce two different numbers even with identical GPUs. Only now consider silicon. Swap the two GPUs between the two hosts. If the slower number follows the GPU, the silicon is genuinely different. If the slower number stays with the host, the host is the cause. Most of the “is this a defective unit?” investigations we’ve reviewed close out at step 2 or step 3. The unit was fine; the executor was different. Three patterns that recur This is not an edge case. The same three shapes show up across teams comparing accelerators, fleets upgrading drivers, and buyers reproducing vendor benchmarks: A team buys two of the same accelerator. Benchmark scores differ. The team investigates the silicon. They find no fault, and the difference persists. The actual cause is that the two hosts have slightly different driver versions, or were thermally pre-conditioned differently before the test. The investigation is in the wrong layer. A team upgrades a driver across a fleet. Benchmark scores shift. The team attributes the shift to “the new driver.” The actual cause is the new driver’s interaction with the framework’s vendored libraries — a property of the executor configuration, not of the driver alone. The attribution is incomplete. A vendor publishes a benchmark on a specific stack. A buyer reproduces the test on their own stack and gets a different number. The buyer suspects vendor inflation. The actual cause is that the buyer’s executor configuration differs from the vendor’s, and the benchmark is internally consistent within each configuration. The interpretation is misframed. In each case, the “same GPU” equivalence class hid the variable that actually mattered. The methodological consequence If “same GPU” is not a useful equivalence class for performance comparison, then a benchmark report must record the equivalence class that actually is useful — the AI Executor — and any comparison must hold that broader class constant. The minimum disclosure surface for an AI accelerator benchmark to be comparable to another report on the same hardware: Accelerator model and unit ID (where unit-to-unit variance is being investigated). Driver version. CUDA / runtime version, plus its source (system install vs framework-vendored). Framework version and wheel source. Kernel library versions (cuDNN, cuBLAS, NCCL). OS and kernel version. Host platform (CPU, memory, PCIe topology relevant to data movement). Cooling configuration and ambient conditions. Power-cap setting. Co-tenant load policy during measurement. Workload, precision regime, batch size, and concurrency configuration. Whether warm-up was excluded; the measurement window length. A report that names these can be compared meaningfully to another report that names them. A report that names only the GPU model and a throughput number is reporting on an unspecified executor, and “same GPU” between that report and any other is not a comparison the reader can perform. The framing that helps The model number is a hardware identity, not a performance contract. Performance is a property of the AI Executor — silicon plus driver plus runtime plus framework plus precision plus host plus thermal envelope — and “same model number” holds only the first item constant. Benchmark differences between two same-model GPUs are the expected consequence of executor variance, not a sign of hardware fault. Comparing benchmarks across hosts requires the executor configuration to be disclosed and held constant, which is a stricter requirement than matching model numbers. The operational expression is that identical hardware is a necessary but not sufficient condition for identical performance — the executor configuration is the sufficient condition the benchmark methodology has to enforce. Our work on why identical GPUs perform differently extends this into the diagnostic sequence; LynxBench AI treats the AI Executor as the unit of measurement for exactly this reason — the model number is an identity property of one component, and benchmark comparability requires the full executor configuration to be the unit of equivalence. The model number tells you what you bought. The executor tells you what it will do. Which AI Executor — kernel coverage, runtime, memory hierarchy, scheduler, driver — is the benchmark score in front of you actually measuring, and would your deployment reproduce it?