What does GPU utilisation actually measure — and why is GPU-busy percentage misleading?

nvidia-smi GPU-Util = % of sampling interval with any kernel running. Says nothing about capacity used. Kernel using 5% of SMs at 100% time = 100% util. Kernel using 100% SMs at 50% time = 50% util. First is profoundly underutilised. Real metrics: SM occupancy (warp slot fill), memory bandwidth (GB/s vs peak), tensor core occupancy, power vs TDP. Diagnostic: 100% nvidia-smi but 20% SM + 30% memory = inefficient — fix at workload level (batch, kernel, model, data pipeline) not procurement. Profilers: Nsight Compute/Systems (NVIDIA), ROCprof (AMD), VTune (SYCL/oneAPI).

How do I compute TCO per useful FLOP rather than per purchased FLOP?

Three steps: (1) total cost = capital amortisation + power + cooling + facilities + networking + ops engineering (or cloud bill directly); (2) useful FLOPs = workload throughput in ops/sec or tokens/sec × time, per-workload accounting; (3) divide = dollars per useful petaFLOP. Comparable across generations, cloud vs on-premise, teams. Exposes hidden decisions: 2x peak GPU may deliver only 1.2x useful for bottlenecked workload; cloud rental expensive per hour may be cheap per useful FLOP vs on-premise idle 70%. Compute monthly to catch underutilisation as it develops, not annually after procurement.

Training a Language Model on a Single GPU in one day

Q: How do I calculate the true cost of an underutilised GPU fleet?

Three components: capital amortised over 3-5y useful life, operating (power/cooling/datacentre or cloud/networking), opportunity cost of denied or absorbed workloads. Divide sum by useful FLOPs delivered (not purchased). Example: H100 with 1979 TF/s peak delivering 600 TF/s average = 3.3x cost per useful FLOP vs per purchased FLOP. $30k H100 at 30% useful utilisation = $100k equivalent peak capacity per year. Procurement looks different — second H100 doubles purchased FLOPs but may add only marginal useful FLOPs if bottleneck is scheduling/pipeline not compute.

Q: Which workload patterns most often leave GPU capacity on the table?

Small batches on large GPUs (H100 at batch=1) — fix with continuous batching (vLLM, Triton) or right-sizing (L4/T4). Burst workloads on dedicated capacity (4hr/week training on 24/7 GPU = 4-5% use) — shared scheduling (K8s/Slurm), spot/preemptible. Data pipeline bottlenecks (GPU waits on CPU pre-proc) — NVIDIA DALI, multi-worker loaders, GPU pre-proc. Poor model-server concurrency — server-level batching. Idle dev capacity — shared dev environments. Sequential training — parallelisation, eliminate resource waits.

Q: Should I procure additional GPU capacity or first profile what I have?

Profile first, always. Cost of a representative-week profile is negligible vs procurement; information is essential. Output answers: per-workload capacity consumption, bottlenecks (compute/memory/loading/scheduling), realistic utilisation with current architecture vs rework needed. Drives decision: optimise bottleneck (often cheaper than procurement), procure right hardware (often different from assumed), or both. Procurement without profiling commits to wrong choice — buy H100s when L4s were correct, buy GPUs when scheduling rework would unlock capacity. Exception: genuine demand spikes (launch, customer commitment) — procure then profile.

Q: What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

Realistic: continuous batching LLM inference 2-4x throughput per GPU; right-sizing (move small-batch off A100/H100 to L4/T4) 30-60% cost cut; eliminating data bottlenecks 10-30%; shared scheduling for dev/burst 30-50% fleet reduction; quantisation (FP8/INT8/distillation) 1.5-3x. Stacked: 3-5x inference effective capacity, 30-50% training. 10-GPU fleet at 3x = 30 effective GPUs without procurement. Trade-off: optimisation = engineering time + maintenance; rental = money no engineering. Crossover depends on scale — small workloads rent, large workloads optimise pays in months. Calculable from TCO-per-useful-FLOP.

Introduction

Training a language model on a single GPU in a day is possible because that GPU runs near its peak FLOP budget for that day. Most GPU fleets do not. The hidden cost of underutilisation is the gap between purchased FLOPs and useful FLOPs — a gap that often runs 40-70% in production inference fleets and 20-40% in training clusters. Buying more GPUs to solve a workload problem when the existing fleet is half idle is the most common procurement mistake in 2026 AI infrastructure. See GPU engineering for the broader landing this article serves.

The honest 2026 picture: most teams that procure additional GPU capacity discover, after profiling, that their existing capacity is underutilised in ways that more hardware will not fix.

What this means in practice

Underutilisation costs accumulate per useful FLOP, not per purchased FLOP.
GPU-busy percentage from nvidia-smi is misleading; SM occupancy and memory bandwidth are the real signals.
Burst, low-batch, and small-model inference workloads leave the most capacity idle.
Profile first, procure second; the savings from optimisation routinely exceed the cost of new hardware.

How do I calculate the true cost of an underutilised GPU fleet?

True cost has three components. Capital cost amortised over useful life (typically 3-5 years), operating cost (power, cooling, datacentre or cloud rental, networking), and opportunity cost (the workloads that could have run on the idle capacity but were either denied or absorbed by additional procurement). The sum divided by useful FLOPs delivered gives cost per useful FLOP.

Useful FLOPs is the operative term. Purchased FLOPs is what the spec sheet promised; useful FLOPs is what the workload actually consumed productively. If an H100 with 1979 TF/s of FP16 capacity produces output at an effective 600 TF/s averaged across the year, the cost per useful FLOP is 3.3x the cost per purchased FLOP. The team that procured the H100 paid for 1979 TF/s; they used 600.

The calculation surfaces uncomfortable numbers. A team running a $30k H100 with 30% useful utilisation is paying $100k of equivalent peak capacity per year of useful work. The procurement decision looks different when expressed in those terms — buying a second H100 to “double capacity” doubles purchased FLOPs but may add only marginal useful FLOPs if the bottleneck is scheduling or pipeline architecture rather than raw compute.

What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading?

The number reported by nvidia-smi as “GPU-Util” is the percentage of the last sampling interval during which at least one kernel was executing on the GPU. It says nothing about how much of the GPU’s compute capacity that kernel was using. A kernel that uses 5% of the SMs at 100% of the time reports 100% utilisation. A kernel that uses 100% of the SMs for 50% of the time reports 50%. The first is a profoundly underutilised GPU; the second may be well-utilised but scheduling-limited.

The metrics that matter. SM occupancy: the fraction of available warp slots that are filled across the streaming multiprocessors. Memory bandwidth utilisation: actual GB/s consumed versus theoretical peak. Tensor core occupancy for ML workloads: how much of the matrix-multiply throughput is being used. Power draw versus thermal design power: a partial proxy for actual work performed.

The diagnostic pattern. A GPU showing 100% nvidia-smi utilisation but 20% SM occupancy and 30% memory bandwidth is being used inefficiently — the workload is keeping the GPU technically busy without exercising its capacity. The fix is at the workload level (batch size, kernel choice, model architecture, data pipeline), not at the procurement level.

Profilers expose these metrics: Nsight Compute and Nsight Systems for NVIDIA, ROCprof for AMD, Intel VTune for SYCL/oneAPI. The teams that procure intelligently use these regularly; the teams that over-procure read GPU-busy from nvidia-smi and assume capacity is exhausted.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Three-step calculation. First, total cost over the period: capital amortisation + power + cooling + facilities + networking + operations engineering allocated to the fleet. Cloud-equivalent calculation uses the cloud bill directly.

Second, useful FLOPs delivered: the throughput of the workloads run on the fleet, measured in operations-per-second or tokens-per-second equivalents, multiplied by the time spent producing them. The measurement requires per-workload accounting — what was actually computed, not what was technically running.

Third, divide. The result is dollars-per-useful-petaFLOP or whatever unit fits the workload. The number is comparable across hardware generations, cloud vs on-premise, and across teams.

The TCO-per-useful-FLOP metric exposes decisions that the TCO-per-purchased-FLOP metric hides. A newer GPU with 2x peak performance might deliver 1.2x useful performance for a given workload because the workload is bottlenecked elsewhere; the procurement that looked obviously beneficial at peak metrics looks marginal at useful metrics. Cloud rental that looks expensive per hour might look cheap per useful FLOP if the on-premise alternative sits idle 70% of the time.

The calculation requires discipline. Teams that compute it monthly catch underutilisation as it develops. Teams that compute it annually catch it after procurement decisions have already been made.

Which workload patterns most often leave GPU capacity on the table?

Pattern 1: small batches on large GPUs. An H100 designed for batch sizes of hundreds of sequences running inference at batch size 1 uses a fraction of its compute. The remedy is batching at the request level (vLLM-style continuous batching, dynamic batching in Triton) or matching GPU to workload (use an L4 or T4 for small-batch inference).

Pattern 2: burst workloads on dedicated capacity. A training job that runs for 4 hours twice a week on a GPU that exists 24/7 uses 4-5% of the capacity. The remedy is shared scheduling (Kubernetes with GPU scheduling, Slurm), spot/preemptible cloud instances for non-time-critical work, or smaller dedicated GPUs with elastic capacity.

Pattern 3: data pipeline bottlenecks. The GPU waits for data because the pre-processing pipeline runs on CPU and cannot keep up. SM occupancy looks low; nvidia-smi looks moderate. The remedy is faster data loading (NVIDIA DALI, multi-worker dataloaders), pre-processing on GPU where possible, or sufficient CPU and disk I/O to feed the GPU at its consumption rate.

Pattern 4: model serving with poor concurrency. A model server that accepts one request at a time underutilises the GPU. The remedy is concurrent request handling with batching at the server level.

Pattern 5: idle development capacity. Developers with dedicated GPUs they use for 2-3 hours a day. The remedy is shared development environments with on-demand GPU allocation, or smaller per-developer GPUs with shared larger GPUs for heavy work.

Pattern 6: poorly scheduled training. Multiple training jobs sequenced rather than parallelised, or training jobs that wait on shared resources (data, parameter servers, distributed coordination) for significant fractions of wall-clock time.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first, always. The cost of running a profiler over a representative week of workloads is negligible compared to GPU procurement. The information it produces is essential for any informed procurement decision.

The profiling output should answer: which workloads consume which fraction of fleet capacity, where the bottlenecks are (compute, memory bandwidth, data loading, scheduling), what utilisation is realistic with current architecture and what would require workload-level rework. The output then drives the decision: optimise the bottleneck (often cheaper than procurement), procure the right hardware (different from what was initially assumed), or both.

Procurement without profiling commits to a hardware choice based on assumed utilisation that often turns out wrong. The team buys more H100s for inference and discovers that L4s would have been more cost-effective, or buys more GPUs when scheduling rework would have unlocked existing capacity.

The exception. Genuine demand spikes (a new product launch, a customer commitment) may not allow time for profiling. In these cases procure to meet the commitment, then profile afterwards to inform the next round.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

Realistic savings from optimisation. Continuous batching for LLM inference: typically 2-4x throughput per GPU. Right-sizing GPU choice (moving small-batch inference off A100/H100 to L4/T4): 30-60% cost reduction at equivalent throughput. Eliminating data pipeline bottlenecks: 10-30% throughput improvement on data-bound workloads. Shared scheduling for development and burst workloads: 30-50% fleet capacity reduction. Quantisation and model optimisation (FP8, INT8, distillation): 1.5-3x throughput depending on accuracy tolerance.

Stacked optimisations frequently produce 3-5x effective capacity improvement on inference fleets and 30-50% on training clusters. The cost equivalent is significant: a team running a 10-GPU inference fleet that captures 3x throughput per GPU has effectively expanded to 30 GPUs of capacity without procurement.

The trade-off. Optimisation requires engineering time and ongoing maintenance. Renting more cloud GPUs requires money and no engineering. The economic crossover depends on workload scale: for small workloads, rental is cheaper than optimisation engineering; for large workloads, optimisation pays back within months. The decision is calculable from the TCO-per-useful-FLOP metric.

Limitations that remained

GPU utilisation measurement remains fragmented across vendor toolchains and harder to consolidate across heterogeneous fleets. Workload-level accounting (which job produced which useful FLOPs) requires instrumentation that many teams do not have. The savings from optimisation degrade as workloads change — gains captured today require ongoing engineering to retain as model versions, batch sizes, and traffic patterns evolve. Cloud spot/preemptible pricing changes the rental-vs-own calculation in ways that require monthly re-evaluation. The honest framing: profiling and optimisation are continuous engineering practices, not one-time projects.

How TechnoLynx Can Help

TechnoLynx works on production GPU fleet optimisation — profiling utilisation across compute, memory bandwidth, and scheduling, computing TCO per useful FLOP for honest procurement decisions, and implementing the batching, scheduling, and pipeline changes that recover capacity. If your team is considering GPU procurement or suspects existing capacity is underutilised, contact us.

Image credits: Freepik