Introduction Training a language model on a single GPU in a day is possible because that GPU runs near its peak FLOP budget for that day. Most GPU fleets do not. The hidden cost of underutilisation is the gap between purchased FLOPs and useful FLOPs — a gap that often runs 40-70% in production inference fleets and 20-40% in training clusters. Buying more GPUs to solve a workload problem when the existing fleet is half idle is the most common procurement mistake in 2026 AI infrastructure. See GPU engineering for the broader landing this article serves. The honest 2026 picture: most teams that procure additional GPU capacity discover, after profiling, that their existing capacity is underutilised in ways that more hardware will not fix. What this means in practice Underutilisation costs accumulate per useful FLOP, not per purchased FLOP. GPU-busy percentage from nvidia-smi is misleading; SM occupancy and memory bandwidth are the real signals. Burst, low-batch, and small-model inference workloads leave the most capacity idle. Profile first, procure second; the savings from optimisation routinely exceed the cost of new hardware. How do I calculate the true cost of an underutilised GPU fleet? True cost has three components. Capital cost amortised over useful life (typically 3-5 years), operating cost (power, cooling, datacentre or cloud rental, networking), and opportunity cost (the workloads that could have run on the idle capacity but were either denied or absorbed by additional procurement). The sum divided by useful FLOPs delivered gives cost per useful FLOP. Useful FLOPs is the operative term. Purchased FLOPs is what the spec sheet promised; useful FLOPs is what the workload actually consumed productively. If an H100 with 1979 TF/s of FP16 capacity produces output at an effective 600 TF/s averaged across the year, the cost per useful FLOP is 3.3x the cost per purchased FLOP. The team that procured the H100 paid for 1979 TF/s; they used 600. The calculation surfaces uncomfortable numbers. A team running a $30k H100 with 30% useful utilisation is paying $100k of equivalent peak capacity per year of useful work. The procurement decision looks different when expressed in those terms — buying a second H100 to “double capacity” doubles purchased FLOPs but may add only marginal useful FLOPs if the bottleneck is scheduling or pipeline architecture rather than raw compute. What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading? The number reported by nvidia-smi as “GPU-Util” is the percentage of the last sampling interval during which at least one kernel was executing on the GPU. It says nothing about how much of the GPU’s compute capacity that kernel was using. A kernel that uses 5% of the SMs at 100% of the time reports 100% utilisation. A kernel that uses 100% of the SMs for 50% of the time reports 50%. The first is a profoundly underutilised GPU; the second may be well-utilised but scheduling-limited. The metrics that matter. SM occupancy: the fraction of available warp slots that are filled across the streaming multiprocessors. Memory bandwidth utilisation: actual GB/s consumed versus theoretical peak. Tensor core occupancy for ML workloads: how much of the matrix-multiply throughput is being used. Power draw versus thermal design power: a partial proxy for actual work performed. The diagnostic pattern. A GPU showing 100% nvidia-smi utilisation but 20% SM occupancy and 30% memory bandwidth is being used inefficiently — the workload is keeping the GPU technically busy without exercising its capacity. The fix is at the workload level (batch size, kernel choice, model architecture, data pipeline), not at the procurement level. Profilers expose these metrics: Nsight Compute and Nsight Systems for NVIDIA, ROCprof for AMD, Intel VTune for SYCL/oneAPI. The teams that procure intelligently use these regularly; the teams that over-procure read GPU-busy from nvidia-smi and assume capacity is exhausted. How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP? Three-step calculation. First, total cost over the period: capital amortisation + power + cooling + facilities + networking + operations engineering allocated to the fleet. Cloud-equivalent calculation uses the cloud bill directly. Second, useful FLOPs delivered: the throughput of the workloads run on the fleet, measured in operations-per-second or tokens-per-second equivalents, multiplied by the time spent producing them. The measurement requires per-workload accounting — what was actually computed, not what was technically running. Third, divide. The result is dollars-per-useful-petaFLOP or whatever unit fits the workload. The number is comparable across hardware generations, cloud vs on-premise, and across teams. The TCO-per-useful-FLOP metric exposes decisions that the TCO-per-purchased-FLOP metric hides. A newer GPU with 2x peak performance might deliver 1.2x useful performance for a given workload because the workload is bottlenecked elsewhere; the procurement that looked obviously beneficial at peak metrics looks marginal at useful metrics. Cloud rental that looks expensive per hour might look cheap per useful FLOP if the on-premise alternative sits idle 70% of the time. The calculation requires discipline. Teams that compute it monthly catch underutilisation as it develops. Teams that compute it annually catch it after procurement decisions have already been made. Which workload patterns most often leave GPU capacity on the table? Pattern 1: small batches on large GPUs. An H100 designed for batch sizes of hundreds of sequences running inference at batch size 1 uses a fraction of its compute. The remedy is batching at the request level (vLLM-style continuous batching, dynamic batching in Triton) or matching GPU to workload (use an L4 or T4 for small-batch inference). Pattern 2: burst workloads on dedicated capacity. A training job that runs for 4 hours twice a week on a GPU that exists 24/7 uses 4-5% of the capacity. The remedy is shared scheduling (Kubernetes with GPU scheduling, Slurm), spot/preemptible cloud instances for non-time-critical work, or smaller dedicated GPUs with elastic capacity. Pattern 3: data pipeline bottlenecks. The GPU waits for data because the pre-processing pipeline runs on CPU and cannot keep up. SM occupancy looks low; nvidia-smi looks moderate. The remedy is faster data loading (NVIDIA DALI, multi-worker dataloaders), pre-processing on GPU where possible, or sufficient CPU and disk I/O to feed the GPU at its consumption rate. Pattern 4: model serving with poor concurrency. A model server that accepts one request at a time underutilises the GPU. The remedy is concurrent request handling with batching at the server level. Pattern 5: idle development capacity. Developers with dedicated GPUs they use for 2-3 hours a day. The remedy is shared development environments with on-demand GPU allocation, or smaller per-developer GPUs with shared larger GPUs for heavy work. Pattern 6: poorly scheduled training. Multiple training jobs sequenced rather than parallelised, or training jobs that wait on shared resources (data, parameter servers, distributed coordination) for significant fractions of wall-clock time. Should I procure additional GPU capacity or first profile the utilisation of what I have? Profile first, always. The cost of running a profiler over a representative week of workloads is negligible compared to GPU procurement. The information it produces is essential for any informed procurement decision. The profiling output should answer: which workloads consume which fraction of fleet capacity, where the bottlenecks are (compute, memory bandwidth, data loading, scheduling), what utilisation is realistic with current architecture and what would require workload-level rework. The output then drives the decision: optimise the bottleneck (often cheaper than procurement), procure the right hardware (different from what was initially assumed), or both. Procurement without profiling commits to a hardware choice based on assumed utilisation that often turns out wrong. The team buys more H100s for inference and discovers that L4s would have been more cost-effective, or buys more GPUs when scheduling rework would have unlocked existing capacity. The exception. Genuine demand spikes (a new product launch, a customer commitment) may not allow time for profiling. In these cases procure to meet the commitment, then profile afterwards to inform the next round. What cost savings are realistic from optimising utilisation versus renting more cloud GPUs? Realistic savings from optimisation. Continuous batching for LLM inference: typically 2-4x throughput per GPU. Right-sizing GPU choice (moving small-batch inference off A100/H100 to L4/T4): 30-60% cost reduction at equivalent throughput. Eliminating data pipeline bottlenecks: 10-30% throughput improvement on data-bound workloads. Shared scheduling for development and burst workloads: 30-50% fleet capacity reduction. Quantisation and model optimisation (FP8, INT8, distillation): 1.5-3x throughput depending on accuracy tolerance. Stacked optimisations frequently produce 3-5x effective capacity improvement on inference fleets and 30-50% on training clusters. The cost equivalent is significant: a team running a 10-GPU inference fleet that captures 3x throughput per GPU has effectively expanded to 30 GPUs of capacity without procurement. The trade-off. Optimisation requires engineering time and ongoing maintenance. Renting more cloud GPUs requires money and no engineering. The economic crossover depends on workload scale: for small workloads, rental is cheaper than optimisation engineering; for large workloads, optimisation pays back within months. The decision is calculable from the TCO-per-useful-FLOP metric. Limitations that remained GPU utilisation measurement remains fragmented across vendor toolchains and harder to consolidate across heterogeneous fleets. Workload-level accounting (which job produced which useful FLOPs) requires instrumentation that many teams do not have. The savings from optimisation degrade as workloads change — gains captured today require ongoing engineering to retain as model versions, batch sizes, and traffic patterns evolve. Cloud spot/preemptible pricing changes the rental-vs-own calculation in ways that require monthly re-evaluation. The honest framing: profiling and optimisation are continuous engineering practices, not one-time projects. How TechnoLynx Can Help TechnoLynx works on production GPU fleet optimisation — profiling utilisation across compute, memory bandwidth, and scheduling, computing TCO per useful FLOP for honest procurement decisions, and implementing the batching, scheduling, and pipeline changes that recover capacity. If your team is considering GPU procurement or suspects existing capacity is underutilised, contact us. Image credits: Freepik