The spec sheet describes a moment, not a steady state
A GPU data sheet lists a boost clock frequency — per NVIDIA’s published specifications, roughly 2.1 GHz for an H100 SXM. That frequency is real. The chip does reach it. What the spec sheet omits is how long it stays there under a production AI workload, and the answer, for sustained dense compute, is usually “not very long.”
Boost clocks are transient by design. They represent the maximum frequency the chip will sustain when thermal headroom exists and power budget allows. Under the sustained full-load conditions characteristic of neural network training or large-batch inference, both headroom and budget are consumed within minutes. The clock settles to a lower, sustainable frequency — and that settled frequency is what determines your actual throughput over hours and days.
This isn’t a defect. It’s thermal physics, and it governs performance more directly than any software optimization.
Power limits as performance governors
Modern data center GPUs operate within a power envelope managed by onboard firmware. Per published specifications, the NVIDIA A100 SXM has a default TDP of 400W; the H100 SXM is rated at 700W. These are not average power draws — they are limits. When the chip’s instantaneous power consumption approaches the limit, the firmware reduces clock frequency to keep power within bounds.
For AI workloads that fully exercise tensor cores, the power limit is typically the first constraint that activates. Dense matrix multiplications — the dominant operation in both training and inference — drive nearly every functional unit on the die simultaneously. This is the highest-power operating regime the GPU encounters, and it means the power governor engages earlier and more aggressively than in workloads that leave portions of the die idle.
The implication for performance is direct: your training throughput is often a function of the power budget, not the theoretical peak FLOPS. Two identical GPUs at different power limits (configurable via nvidia-smi -pl) will produce measurably different throughput. One running at a 300W limit will sustain a lower clock than one at 400W, and the throughput difference is roughly proportional to the clock difference for compute-bound workloads.
Thermal throttling: gradual, not catastrophic
The word “throttling” implies an emergency — something overheating and desperately pulling back. In data center GPU operation, thermal management is more mundane and more continuous than that.
As the GPU die heats under sustained load, the firmware progressively reduces clock frequency to maintain junction temperature below the rated maximum (typically around 83°C for recent NVIDIA data center GPUs, per published thermal specifications). This is a smooth, continuous process, not a cliff edge. The clock doesn’t drop from its boost frequency to the base frequency in one step; it decreases gradually over minutes, stabilizing wherever the power dissipation matches the cooling capacity.
Cooling capacity itself is a system-level property. It depends on the server chassis design, fan speed profiles, ambient temperature, and most critically, the thermal load from neighboring components. In an 8-GPU DGX node, the interior GPUs see higher ambient temperatures than the edge cards. The same chip, running the same workload, settles at different sustained clocks depending on its position in the chassis. We’ve observed steady-state frequency differences of roughly 60-90 MHz between the hottest and coolest GPU positions in the same node — enough to produce visible throughput variation across cards.
This interacts with why AI performance changes over time in a direct way: the thermal trajectory of the first 15 minutes of a workload is characteristically different from the next eight hours. Early measurements capture a GPU at above-steady-state frequencies, under-steady-state temperatures. The performance they report is real but temporary.
Boost clocks as a marketing footnote
GPU spec sheets prominently feature boost clock frequencies. Marketing materials build performance claims around them. Benchmark results that happen to be measured during the boost-clock phase inherit this flattering number.
The problem isn’t that boost clocks are fictitious — the chip does reach them. The problem is that they describe a capability that exists under specific thermal and power conditions, not a guarantee that holds under production load. For workloads that run for hours or days, the boost clock is a brief initial state that the system moves through on its way to steady-state operation.
The steady-state frequency — sometimes called the sustained or operating frequency — is what actually determines sustained throughput. In practice, it’s typically on the order of 100-300 MHz below the advertised boost frequency for data center GPUs under heavy load, which translates to roughly a 5-15% throughput gap between the boost-phase number and the steady-state number.
This gap is well-known to hardware engineers and largely invisible to the software engineers, data scientists, and procurement teams who consume benchmark results and spec sheets. Surfacing it is one of the basic requirements for honest performance reporting.
Dense GPU environments amplify the problem
Single-GPU testing in an open bench or lightly loaded chassis produces the most optimistic thermal behavior. The GPU has ample cooling airflow, minimal thermal interference, and stays close to boost frequencies longer.
Production deployments are dense. Eight GPUs per node. Multiple nodes per rack. The thermal load per unit volume is substantial, and the cooling infrastructure must handle the aggregate heat output under sustained operation. In practice, this means:
The air entering each successive GPU in the airflow path is warmer than what the previous GPU expelled. Interior positions run hotter. Sustained clocks are lower in the middle of the chassis than at the edges. The aggregate throughput of an 8-GPU node is not 8× the single-GPU throughput measured on an open bench.
Rack-level effects add another layer. Hot aisle temperature rises as more nodes in the rack reach full load. If the data center’s cooling capacity is marginal or unevenly distributed, pods of nodes can experience sustained above-target ambient temperatures, pushing GPU steady-state clocks lower across the board.
These are operational realities that no single-GPU benchmark captures. They’re also realities that the mythology around sustained GPU utilization often obscures — a GPU can report 100% utilization while operating at a thermally reduced clock that delivers substantially less throughput than the spec sheet implies.
Living with the physics
None of this is fixable by software optimization or clever engineering. Power limits and thermal physics are hard constraints. The practical response is not to fight them but to account for them:
Measure performance under sustained, thermally settled conditions. Don’t report results from the first five minutes of a cold start. Let the system reach thermal equilibrium — which in our experience can take roughly 15 to 30 minutes under full load — and then begin measurement.
Report power draw alongside throughput. Performance-per-watt is a more stable and more informative metric than raw throughput for workloads that are power-limited. Comparing GPUs at equal power budgets often reveals different performance rankings than comparing at default settings.
Design around steady-state, not peak. Capacity planning that assumes boost-clock throughput will typically overcount by roughly 5-15%. Infrastructure sizing based on sustained, thermally settled performance produces accurate predictions. As discussed in how peak vs. steady-state performance diverge, the gap between what the hardware can do briefly and what it does continuously is the gap between optimistic planning and realistic planning.
The physics always wins. The only question is whether your measurements and capacity models acknowledge that before deployment or discover it afterward.