Should I procure additional capacity or first profile what I have?

Profile first, always. The audit typically recovers 30–80% of headroom procurement would have otherwise bought, with power-purchase-agreement implications often dwarfing the hardware cost.

What cost savings are realistic from optimising versus renting more cloud GPUs?

Host-pipeline 20–80%, batch and fusion 30–100%, mixed precision 2–4× useful FLOPs at lower power per FLOP, memory-layout 30–200% on memory-bound kernels. Audited fleets commonly absorb 2–3× current workload without additional procurement.

AI Data Center Power: Why Nameplate TDP Is Not a Capacity Plan

Q: How do I calculate the true cost of an underutilised GPU fleet?

Full TCO (hardware amortisation plus power, cooling, and operational overhead, annualised) divided by useful FLOPs measured from a one-to-two-week workload profile — DCGM for power, Nsight Compute for FLOPs. The composite figure is regularly 3–10× the back-of-envelope cost-per-GPU-hour teams quote during procurement.

Q: What does GPU utilisation actually measure?

The nvidia-smi headline measures whether any SM had any warp scheduled during the sampling window — it says nothing about how many SMs were active or whether the kernel did useful work. Power-planning needs GPU power draw (DCGM), SM occupancy plus memory-bandwidth utilisation (Nsight Compute), and roofline analysis together.

Q: How do I compute TCO per useful FLOP rather than per purchased FLOP?

Useful FLOPs are the workload's actual matmul, convolution, and attention work measured at kernel level, not the spec-sheet peak. Include power and cooling alongside hardware amortisation in the TCO numerator and divide by annualised useful FLOPs. Track useful-FLOPs-per-kWh as the headline efficiency number.

Q: Which workload patterns most often leave GPU capacity on the table?

Four patterns dominate: host-bound pipelines, small-batch workloads, memory-bound kernels run as if compute-bound, and missed mixed-precision opportunities. Each has a distinct power signature, and together they account for the majority of stranded capacity we find in audits.

Introduction

A capacity plan built on accelerator nameplate TDP is a fiction. The number on the spec sheet is the maximum the silicon can draw under sustained worst-case load; the number the deployed workload actually pulls is workload-conditional, often off by 30–60% in either direction, and varies across the same hardware depending on which model is loaded and how the serving stack is configured. This article makes the case that AI data center power is a function of the workload first and the hardware second, walks the failure modes the nameplate-multiplied plan produces, and frames power as a capacity input that needs the same profiling discipline as GPU FLOPs. See the GPU engineering practice for the broader audit framework.

The naive read is “multiply count by TDP, add cooling overhead, done.” The expert read is that the same fleet runs at radically different power footprints across inference-bound and training-bound workloads, and that the cost of overbuilding capacity is paid forever in capex and stranded power-purchase agreements.

What this means in practice

Nameplate TDP overstates power for memory-bound inference; understates it for sustained training under aggressive utilisation tuning.
Cooling and PUE assumptions depend on the workload mix, not on the count of accelerators.
Profile the actual workload power draw before signing the next power-purchase agreement.
Track power per useful FLOP, not power per purchased FLOP — the metric scales with the procurement decision.

How do I calculate the true cost of an underutilised GPU fleet?

For the power axis specifically: cost is hardware amortisation plus power (kWh × tariff) plus cooling (kWh × tariff × (PUE−1)) plus operational overhead, divided by useful FLOPs (the model FLOPs actually consumed by the workload, not the GPU’s theoretical peak). The composite picture is regularly 3–10× the back-of-envelope estimate teams use when they pitch the next procurement cycle. This range is an observed pattern across our GPU audit engagements, not a benchmarked rate — the multiplier in any specific environment depends on tariff, PUE, and workload mix.

The discipline that closes the gap is profiling. A representative two-week capture of the workload’s actual power draw (DCGM exports power-per-GPU at second-granularity) plus the useful FLOPs the workload consumed (via Nsight Compute or framework-level profilers) produces the per-useful-FLOP power number. Procurement decisions made against this number rather than against nameplate TDP avoid the 30–60% overbuild that becomes stranded power and capex.

What does GPU utilisation actually measure — and why is the GPU-busy percentage misleading?

nvidia-smi’s GPU-utilisation percentage measures whether any Streaming Multiprocessor (SM) had any warp scheduled during the sampling window. It is silent on how many SMs were active, what fraction of memory bandwidth was used, and whether the active kernel did useful work. A 5%-occupancy single-SM kernel shows up as “100% utilised.” Power draw correlates with the composite picture — SM occupancy, memory bandwidth, frequency state — not with the nvidia-smi headline.

For power-planning purposes, the useful triple is GPU power draw (DCGM), SM occupancy plus memory bandwidth utilisation (Nsight Compute), and arithmetic intensity vs roofline. These together explain the power footprint and predict how the footprint will change when the workload mix or model size shifts. Power-planning that uses only the GPU-busy headline systematically misestimates both the average and the peak draw — we see this pattern regularly in audits of fleets sized against nvidia-smi dashboards.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Useful FLOPs are the FLOPs the workload’s actual computation requires (matmul, conv, attention) — measured at kernel level, not the spec sheet peak. For the power-conditional TCO: include power and cooling alongside hardware amortisation, divide by annualised useful FLOPs. Teams that adopt this metric typically discover their existing fleet has 2–5× the useful capacity the procurement plan assumed, which directly translates to deferred power-purchase commitments. The 2–5× is an observed range across our audit engagements; the upper bound is most common in fleets where mixed precision has not yet been adopted.

The operational version of the metric: instrument the workload to log per-kernel FLOPs and per-period GPU power draw; aggregate into useful-FLOPs-per-kWh as the headline efficiency number. This number, tracked over time, exposes whether optimisations actually move the cost-per-output needle or just shift work between bottlenecks.

Which workload patterns most often leave GPU capacity on the table?

Four patterns recur and each has a distinct power signature.

Pattern	Power signature	Useful-work signal	Typical recovery
Host-bound (CPU can’t feed GPU)	Moderate, with idle troughs	SMs idle large fraction of time	20–80% (observed)
Small-batch	Sustained but inefficient	High launch/sync overhead	30–100% (observed)
Memory-bound run as compute-bound	Below nameplate plateau	Bandwidth saturated before compute	30–200% (observed)
Mixed-precision ignored (fp32 only)	2–4× higher per useful FLOP	bf16/int8 path unused	2–4× useful FLOPs (observed)

Host-bound workloads pay for capacity that very little is consumed of. Small-batch workloads burn launch and synchronisation overhead; GPU power is sustained but useful work per Joule is poor. Memory-bound kernels run at lower power than the silicon can sustain because bandwidth saturates before compute does; the throughput plateau is at a lower power point than the nameplate implies. Mixed-precision opportunities ignored: a workload running in fp32 where bf16 or int8 would deliver 2–4× throughput consumes 2–4× more power per useful FLOP than necessary. Each costs the data center money — the four together typically account for the majority of stranded capacity we find in audits.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first, always, before any additional power or accelerator procurement. A GPU performance audit measures actual utilisation and power per workload, identifies where capacity (and the associated power) is wasted, and quantifies the achievable improvement. In our experience, the audit typically recovers 30–80% of headroom that procurement would have spent to buy, and the power-purchase-agreement implications often dwarf the hardware cost. The 30–80% range is an observed pattern; outliers in either direction exist when workloads are already well-tuned or, conversely, when no optimisation work has ever been done.

The audit also exposes when additional procurement is genuinely needed — the workload saturates well-optimised existing capacity, growth demands capacity beyond the optimisation envelope, the new workload mix has a different power profile that the existing infrastructure cannot serve. Those cases become easier to defend against finance after the audit because the number rests on measurement, not on spec-sheet multiplication.There is one more axis the procure-versus-profile decision has to carry: where the wasted spend lands on the bill. On cloud (AWS, Azure), underutilisation shows up directly as billed instance-hours — you pay the published per-GPU-hour rate whether the silicon is saturated or idle, so a host-bound or small-batch workload bleeds money every hour the reservation is live. On-prem, the same waste is buried in already-committed capex and in the power-purchase agreement: the stranded power capacity is paid for regardless of draw, and the marginal cost of one underused GPU is harder to see because it is amortised. The practical consequence is that cloud waste is faster to detect (it is itemised) but on-prem waste is larger in aggregate (it is pre-committed). Either way, profiling the workload before scaling the reservation or the PPA is the move that converts an invisible recurring loss into a measured, fundable optimisation. For the broader conversation about why GPUs are the dominant capacity-planning unit for AI in the first place, our piece on why GPUs are the bottleneck unit for AI sits upstream of this one.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

Audit ranges, with the power axis explicit. Host-pipeline fixes (data loading, preprocessing placement, async transfers) recover 20–80% useful FLOPs without changing average GPU power, dropping power-per-useful-FLOP proportionally. Batch-size and kernel-fusion fixes deliver 30–100% useful FLOPs at moderately higher average power — net power-per-useful-FLOP improves substantially.

Mixed-precision adoption delivers 2–4× useful FLOPs at lower average power per FLOP (the multiply-accumulate units for fp16/bf16/int8 draw less than the fp32 equivalents). Memory-layout and bandwidth optimisations recover 30–200% on memory-bound kernels — power often stays flat as bandwidth saturates earlier, so power-per-useful-FLOP improves dramatically. In aggregate, audited fleets commonly absorb 2–3× the current workload without additional procurement, and the power-purchase-agreement implications make this the highest-leverage optimisation a CFO can fund. These ranges are observed across our engagements rather than externally benchmarked; the per-fleet number depends on starting state.

Limitations that remained

The power axis adds discipline to capacity planning but does not eliminate the rest of the work. Power profiling needs the same tooling (DCGM, Nsight Compute) and the same operational discipline as FLOPs profiling — organisations new to GPU operations underestimate the ramp. Some workloads are genuinely close to optimal; the audit confirms procurement is needed. Cooling and PUE assumptions depend on data center context that the per-workload audit cannot fully model — facilities engineering needs to be in the loop for the full picture. Mixed-precision adoption needs accuracy validation that some teams treat as optional; rushed adoption produces correctness regressions that look like model failures but are precision artefacts.

How TechnoLynx Can Help

TechnoLynx runs GPU performance audits that include the power-and-cooling axis explicitly, exposing the power-per-useful-FLOP picture before the next procurement or power-purchase-agreement signing. If you are about to commit power capacity based on nameplate TDP arithmetic, contact us for a workload-conditional audit first.

Frequently Asked Questions

How does GPU underutilisation on cloud procurement differ from on-prem, and where does the wasted spend show up on each bill?

On cloud (AWS, Azure) the waste is itemised: you pay the published per-GPU-hour rate for every hour a reservation is live, so a host-bound or small-batch workload bleeds money directly into billed instance-hours whether the silicon is saturated or idle. On-prem, the same waste is buried in committed capex and in the power-purchase agreement — the stranded power capacity is paid for regardless of draw, and one underused GPU is hard to spot because its cost is amortised across the fleet. Cloud waste is faster to detect because it is line-itemed; on-prem waste tends to be larger in aggregate because it is pre-committed. In both cases the fix is the same: profile the workload before scaling the reservation or signing the next PPA.

How do I tell whether a cloud GPU reservation is overprovisioned before the next renewal?

Capture per-GPU power draw with DCGM and useful FLOPs with Nsight Compute over a representative one-to-two-week window, then compute power and capacity per useful FLOP against what you are billed for. If the workload is host-bound, small-batch, memory-bound, or stuck in fp32, the useful-work fraction will sit well below the instance-hours you pay for. Renewing or scaling the reservation against the nvidia-smi headline rather than this profile is what locks in the recurring overspend.

Does the power-per-useful-FLOP metric still matter when GPUs are rented rather than owned?

Yes — on rented capacity the power cost is folded into the per-GPU-hour rate, so the metric reframes as useful work per billed hour, but the diagnosis is identical. A workload that wastes power per useful FLOP on-prem wastes billed hours on cloud, because both are paying for capacity the workload never converts into useful work. Tracking useful-FLOPs-per-kWh (or per billed hour) over time is what tells you whether an optimisation actually moved the cost-per-output needle or just shifted the bottleneck.

Image credits: Freepik