AI Data Center Power: Why Nameplate TDP Is Not a Capacity Plan

AI data center power is workload-conditional. Why nameplate TDP misses, and how to reason about power as a capacity-planning input.

AI Data Center Power: Why Nameplate TDP Is Not a Capacity Plan
Written by TechnoLynx Published on 13 May 2026

Introduction

A capacity plan built on accelerator nameplate TDP is a fiction. The number on the spec sheet is the maximum the silicon can draw under sustained worst-case load; the number the deployed workload actually pulls is workload-conditional, often off by 30–60% in either direction, and varies across the same hardware depending on which model is loaded and how the serving stack is configured. This article makes the case that AI data center power is a function of the workload first and the hardware second, walks the failure modes the nameplate-multiplied plan produces, and frames power as a capacity input that needs the same profiling discipline as GPU FLOPs. See the GPU engineering practice for the broader audit framework.

The naive read is “multiply count by TDP, add cooling overhead, done.” The expert read is that the same fleet runs at radically different power footprints across inference-bound and training-bound workloads, and that the cost of overbuilding capacity is paid forever in capex and stranded power-purchase agreements.

What this means in practice

  • Nameplate TDP overstates power for memory-bound inference; understates it for sustained training under aggressive utilisation tuning.
  • Cooling and PUE assumptions depend on the workload mix, not on the count of accelerators.
  • Profile the actual workload power draw before signing the next power-purchase agreement.
  • Track power per useful FLOP, not power per purchased FLOP — the metric scales with the procurement decision.

How do I calculate the true cost of an underutilised GPU fleet?

For the power axis specifically: cost is hardware amortisation plus power (kWh × tariff) plus cooling (kWh × tariff × (PUE−1)) plus operational overhead, divided by useful FLOPs (the model FLOPs actually consumed by the workload, not the GPU’s theoretical peak). The composite picture is regularly 3–10× the back-of-envelope estimate teams use when they pitch the next procurement cycle. This range is an observed pattern across our GPU audit engagements, not a benchmarked rate — the multiplier in any specific environment depends on tariff, PUE, and workload mix.

The discipline that closes the gap is profiling. A representative two-week capture of the workload’s actual power draw (DCGM exports power-per-GPU at second-granularity) plus the useful FLOPs the workload consumed (via Nsight Compute or framework-level profilers) produces the per-useful-FLOP power number. Procurement decisions made against this number rather than against nameplate TDP avoid the 30–60% overbuild that becomes stranded power and capex.

What does GPU utilisation actually measure — and why is the GPU-busy percentage misleading?

nvidia-smi’s GPU-utilisation percentage measures whether any Streaming Multiprocessor (SM) had any warp scheduled during the sampling window. It is silent on how many SMs were active, what fraction of memory bandwidth was used, and whether the active kernel did useful work. A 5%-occupancy single-SM kernel shows up as “100% utilised.” Power draw correlates with the composite picture — SM occupancy, memory bandwidth, frequency state — not with the nvidia-smi headline.

For power-planning purposes, the useful triple is GPU power draw (DCGM), SM occupancy plus memory bandwidth utilisation (Nsight Compute), and arithmetic intensity vs roofline. These together explain the power footprint and predict how the footprint will change when the workload mix or model size shifts. Power-planning that uses only the GPU-busy headline systematically misestimates both the average and the peak draw — we see this pattern regularly in audits of fleets sized against nvidia-smi dashboards.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Useful FLOPs are the FLOPs the workload’s actual computation requires (matmul, conv, attention) — measured at kernel level, not the spec sheet peak. For the power-conditional TCO: include power and cooling alongside hardware amortisation, divide by annualised useful FLOPs. Teams that adopt this metric typically discover their existing fleet has 2–5× the useful capacity the procurement plan assumed, which directly translates to deferred power-purchase commitments. The 2–5× is an observed range across our audit engagements; the upper bound is most common in fleets where mixed precision has not yet been adopted.

The operational version of the metric: instrument the workload to log per-kernel FLOPs and per-period GPU power draw; aggregate into useful-FLOPs-per-kWh as the headline efficiency number. This number, tracked over time, exposes whether optimisations actually move the cost-per-output needle or just shift work between bottlenecks.

Which workload patterns most often leave GPU capacity on the table?

Four patterns recur and each has a distinct power signature.

Pattern Power signature Useful-work signal Typical recovery
Host-bound (CPU can’t feed GPU) Moderate, with idle troughs SMs idle large fraction of time 20–80% (observed)
Small-batch Sustained but inefficient High launch/sync overhead 30–100% (observed)
Memory-bound run as compute-bound Below nameplate plateau Bandwidth saturated before compute 30–200% (observed)
Mixed-precision ignored (fp32 only) 2–4× higher per useful FLOP bf16/int8 path unused 2–4× useful FLOPs (observed)

Host-bound workloads pay for capacity that very little is consumed of. Small-batch workloads burn launch and synchronisation overhead; GPU power is sustained but useful work per Joule is poor. Memory-bound kernels run at lower power than the silicon can sustain because bandwidth saturates before compute does; the throughput plateau is at a lower power point than the nameplate implies. Mixed-precision opportunities ignored: a workload running in fp32 where bf16 or int8 would deliver 2–4× throughput consumes 2–4× more power per useful FLOP than necessary. Each costs the data center money — the four together typically account for the majority of stranded capacity we find in audits.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first, always, before any additional power or accelerator procurement. A GPU performance audit measures actual utilisation and power per workload, identifies where capacity (and the associated power) is wasted, and quantifies the achievable improvement. In our experience, the audit typically recovers 30–80% of headroom that procurement would have spent to buy, and the power-purchase-agreement implications often dwarf the hardware cost. The 30–80% range is an observed pattern; outliers in either direction exist when workloads are already well-tuned or, conversely, when no optimisation work has ever been done.

The audit also exposes when additional procurement is genuinely needed — the workload saturates well-optimised existing capacity, growth demands capacity beyond the optimisation envelope, the new workload mix has a different power profile that the existing infrastructure cannot serve. Those cases become easier to defend against finance after the audit because the number rests on measurement, not on spec-sheet multiplication. For the broader conversation about why GPUs are the dominant capacity-planning unit for AI in the first place, our piece on why GPUs are the bottleneck unit for AI sits upstream of this one.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

Audit ranges, with the power axis explicit. Host-pipeline fixes (data loading, preprocessing placement, async transfers) recover 20–80% useful FLOPs without changing average GPU power, dropping power-per-useful-FLOP proportionally. Batch-size and kernel-fusion fixes deliver 30–100% useful FLOPs at moderately higher average power — net power-per-useful-FLOP improves substantially.

Mixed-precision adoption delivers 2–4× useful FLOPs at lower average power per FLOP (the multiply-accumulate units for fp16/bf16/int8 draw less than the fp32 equivalents). Memory-layout and bandwidth optimisations recover 30–200% on memory-bound kernels — power often stays flat as bandwidth saturates earlier, so power-per-useful-FLOP improves dramatically. In aggregate, audited fleets commonly absorb 2–3× the current workload without additional procurement, and the power-purchase-agreement implications make this the highest-leverage optimisation a CFO can fund. These ranges are observed across our engagements rather than externally benchmarked; the per-fleet number depends on starting state.

Limitations that remained

The power axis adds discipline to capacity planning but does not eliminate the rest of the work. Power profiling needs the same tooling (DCGM, Nsight Compute) and the same operational discipline as FLOPs profiling — organisations new to GPU operations underestimate the ramp. Some workloads are genuinely close to optimal; the audit confirms procurement is needed. Cooling and PUE assumptions depend on data center context that the per-workload audit cannot fully model — facilities engineering needs to be in the loop for the full picture. Mixed-precision adoption needs accuracy validation that some teams treat as optional; rushed adoption produces correctness regressions that look like model failures but are precision artefacts.

How TechnoLynx Can Help

TechnoLynx runs GPU performance audits that include the power-and-cooling axis explicitly, exposing the power-per-useful-FLOP picture before the next procurement or power-purchase-agreement signing. If you are about to commit power capacity based on nameplate TDP arithmetic, contact us for a workload-conditional audit first.

Frequently Asked Questions

How do I calculate the true cost of an underutilised GPU fleet?

Full TCO (hardware amortisation plus power, cooling, and operational overhead, annualised) divided by useful FLOPs measured from a one-to-two-week workload profile — DCGM for power, Nsight Compute for FLOPs. The composite figure is regularly 3–10× the back-of-envelope cost-per-GPU-hour teams quote during procurement, and the gap is dominated by the workload-conditional power and the fraction of purchased FLOPs that never become useful work.

What does GPU utilisation actually measure?

The nvidia-smi headline measures whether any SM had any warp scheduled during the sampling window — it says nothing about how many SMs were active or whether the kernel did useful work. Power-planning needs the composite picture: GPU power draw (DCGM), SM occupancy plus memory-bandwidth utilisation (Nsight Compute), and a roofline view of arithmetic intensity. Sizing power capacity against the nvidia-smi headline alone systematically misestimates both average and peak draw.

How do I compute TCO per useful FLOP rather than per purchased FLOP?

Useful FLOPs are the workload’s actual matmul, convolution, and attention work measured at kernel level, not the spec-sheet peak. Include power and cooling alongside hardware amortisation in the TCO numerator and divide by annualised useful FLOPs. Track useful-FLOPs-per-kWh as the headline efficiency number over time — it exposes whether optimisations move the cost-per-output needle or just shift work between bottlenecks.

Which workload patterns most often leave GPU capacity on the table?

Four patterns dominate: host-bound pipelines where the CPU cannot feed the GPU, small-batch workloads burning launch and synchronisation overhead, memory-bound kernels running as if compute-bound, and missed mixed-precision opportunities where fp32 is used in place of bf16 or int8. Each has a distinct power signature, and the four together account for the majority of stranded capacity we find in audits.

Should I procure additional capacity or first profile what I have?

Profile first, always, before signing additional power or accelerator commitments. The audit typically recovers 30–80% of headroom that procurement would have otherwise bought, and the power-purchase-agreement implications often dwarf the hardware cost. The audit also makes the case for genuine additional procurement defensible to finance, because the number rests on measurement rather than spec-sheet multiplication.

What cost savings are realistic from optimising versus renting more cloud GPUs?

Observed ranges across our audits: host-pipeline fixes recover 20–80% useful FLOPs at flat average power, batch-size and kernel-fusion fixes 30–100%, mixed-precision adoption 2–4× useful FLOPs at lower power per FLOP, and memory-layout work 30–200% on memory-bound kernels. Aggregated, audited fleets commonly absorb 2–3× their current workload without any additional procurement.

Image credits: Freepik

Back See Blogs
arrow icon