AI GPU Utilization Testing: What GPU-Util Means and What It Misses

GPU utilization percentage is a poor AI performance metric

nvidia-smi reports GPU utilization as a percentage. Teams running AI workloads often treat this number as a performance indicator: high utilization equals good, low utilization equals wasted compute. That interpretation is wrong often enough to cause real engineering and procurement mistakes — and it is a clean example of why synthetic benchmarks systematically fail to match real production workloads.

GPU utilization percentage measures how often the GPU was executing at least one kernel in the last 100 ms sampling window. A GPU running a single inefficient kernel 100% of the time shows 100% utilization. A GPU running the same workload 10× faster also shows 100% utilization. The headline numbers are identical; the performance is not. This is a benchmark-realism problem disguised as a metric: the number is honest, but the question it answers is not the one most teams think they are asking.

What does nvidia-smi GPU utilization actually measure?

From NVIDIA’s published documentation, the GPU utilization metric is “Percent of time over the past sample period during which one or more kernels was executing on the GPU.” That is a binary measure: was the GPU doing anything in each tick of the sampling window, not how efficiently it was working, not how many of its compute units were active, and not whether the work was useful for the model being served.

The metric was designed for fleet monitoring — was this card idle, or did something run on it? — and not for performance characterisation. Reading it as a performance metric is a category error. The card can be 100% busy moving memory around, or 100% busy executing a poorly fused custom kernel, and nvidia-smi cannot tell the difference between either case and a well-optimised compute-bound training step.

Where the utilization interpretation breaks

A small table sharpens the failure modes. Each row shows a real workload shape that produces a misleading headline number, and what the actual state of the hardware is underneath.

Situation	GPU utilization (nvidia-smi)	Actual state
Training well-optimised large model	~100%	Efficient — compute-bound
Training with data-loading bottleneck	~100% (during compute)	Inefficient — bubbles between compute bursts
Inference at batch=1	Often 40–70%	Expected for latency-optimised serving
Memory-bandwidth-bound operation	~100%	Expected — limited by memory, not compute
Poorly optimised custom kernel	~100%	Inefficient — many compute units idle

100% GPU utilization can mean either “the hardware is being used efficiently” or “there is a kernel running that is not efficiently using the GPU’s compute units.” The metric does not distinguish. This is the structural mismatch that benchmark consumers most often miss: utilization is a presence signal, not a productivity signal, and a benchmark that reports only utilization is benchmark-shaped without being workload-shaped.

Better metrics for AI GPU performance

When the question is “is this GPU doing useful work for my model?”, the answer has to come from a different layer of the stack. The metrics below are what we collect alongside utilization in any serious benchmarking pass.

Metric	What it measures	How to get it
MFU (Model FLOPS Utilization)	Fraction of theoretical FLOPS achieved	Manual calculation from throughput
SM Occupancy	Fraction of SMs with active warps	NSight Compute
Memory bandwidth utilization	Fraction of peak bandwidth used	NSight Compute / DCGM
Actual throughput (items/sec)	The outcome the system exists for	Application-level measurement

Reporting any single one of these in isolation is also misleading — that is the broader benchmark-realism point — but together they form a triangulation that exposes which resource is actually the bottleneck.

Why does high GPU utilization not mean high performance?

A GPU showing 100% utilization can still be performing poorly. The utilization metric from nvidia-smi indicates that at least one CUDA kernel was active during each sampling period — it says nothing about what that kernel was doing. A memory-copy kernel, a poorly parallelised custom kernel, or an inefficient attention implementation all show as 100% utilization while leaving most of the GPU’s compute units idle.

The distinction matters for capacity planning. A system reporting 95% GPU utilization appears to have no headroom, but profiling can reveal that a large fraction of that time is spent on suboptimal kernels that could be replaced with fused or vendor-optimised alternatives. We have seen, in our engagements, cases where replacing a hand-written CUDA kernel with a cuDNN-optimised equivalent cut inference time substantially — with no change in the headline GPU utilization percentage. The card stayed near 100% busy; each unit of busy time simply processed more useful tokens (observed-pattern, from project work; not a general benchmark claim).

For benchmark testing, GPU utilization should always be reported alongside throughput in samples per second or tokens per second. If two configurations both show roughly 98% utilization but configuration A processes meaningfully more samples per second than configuration B, configuration A is more efficient despite identical utilization. This is the typical shape of a result when one configuration uses optimised paths — FlashAttention, torch.compile, TensorRT, fused kernels via Triton — that extract more useful work from each GPU cycle.

Workload shape dominates the signature you see

Different AI workload types produce different utilization signatures, and understanding these signatures is what makes a benchmark a useful proxy for production rather than a synthetic artefact.

Training workloads typically show high, steady GPU utilization with periodic dips that correspond to gradient synchronisation in distributed training. Shallow dips suggest efficient collective communication via NCCL or NVLink; deep dips point at a communication bottleneck, often a PCIe topology issue or a poor sharding choice. The dip shape, not the average, carries the signal.

Inference serving is the inverse. Utilization tracks request load: at low concurrency, both utilization and latency are low; as concurrency rises, utilization climbs while latency stays roughly flat — until a saturation point, beyond which utilization plateaus near 100% and tail latency rises sharply. The benchmark question is not “what is the peak utilization?” but “where is the saturation knee, and what is the tail latency just before it?” That knee is workload-specific, and a single-stream synthetic benchmark cannot see it at all. The realism question here is not binary; it is about whether the test reproduces the concurrency, queuing, and request-size variability the production system actually experiences.

For teams setting up GPU benchmarking practices, the practical rule is to collect all three metrics — utilization percentage, achieved memory bandwidth, achieved arithmetic throughput — from the first benchmark run, even if only one seems relevant. The complete dataset enables retrospective analysis when a regression appears months later. Collecting incomplete metrics initially and bolting on more later produces a fragmented history with no consistent baseline across time periods, which is one of the most common procurement-grade benchmark failures we encounter.

GPU utilization is not performance walks through the full reasoning behind why the headline number misleads. LynxBench AI treats GPU utilization as one of three axes — utilization, sustained throughput, and effective work per cycle — that must be reported together, because the headline utilization percentage can rise while the useful work the GPU is doing for the model falls.

A correct interpretation pattern

When diagnosing AI performance, the order of operations matters:

Measure actual throughput first (tokens/sec, images/sec, requests/sec at a declared concurrency).
Check whether memory bandwidth is saturated, using NSight Compute or DCGM.
Only then interpret GPU utilization as context — never as the headline metric.

The question to put to any GPU-utilization-driven performance claim is whether the utilization number is paired with throughput on the actual workload at a declared operating point, or whether it is being read as a proxy for productive work it does not measure. Utilization percentage is one of the cleanest examples of this failure class: a number that is technically correct, widely reported, and almost always read as something it is not. Is GPU-utilization percentage actually the binding constraint on the throughput-per-watt your workload delivers under sustained load — or is it a number that flatters a different question entirely?

Frequently Asked Questions

What does the nvidia-smi GPU utilization percentage actually measure?

It reports the percent of time over the last sampling window during which at least one CUDA kernel was executing on the GPU. That is a binary presence signal — was the card doing anything — not a measure of how efficiently its compute units were used or whether the work was useful for the model being served. A card 100% busy on a memory-copy kernel and a card 100% busy on a well-fused compute-bound step both read the same.

Which metrics should I collect alongside GPU utilization in a benchmark run?

Collect actual throughput (tokens/sec, images/sec, or requests/sec at a declared concurrency), achieved memory bandwidth, and Model FLOPS Utilization or SM occupancy from NSight Compute or DCGM. We recommend capturing all of these from the very first run rather than bolting them on later, because a fragmented metric history has no consistent baseline when a regression surfaces months down the line. Together they triangulate which resource is the real bottleneck.

How can a GPU show 95-98% utilization and still leave performance on the table?

High utilization only means a kernel was active, not that the kernel was efficient. Profiling can reveal that much of that busy time is spent on suboptimal kernels that fused or vendor-optimised alternatives — cuDNN, FlashAttention, TensorRT, Triton-fused paths — would replace with more useful work per cycle. We have seen inference time drop substantially after such a swap with no change in the headline utilization number.

What utilization signature should I expect from training versus inference serving?

Training workloads typically show high, steady utilization with periodic dips at gradient synchronisation; shallow dips indicate efficient NCCL or NVLink collectives, deep dips point at a communication or sharding bottleneck. Inference serving instead tracks request load — utilization climbs with concurrency while latency stays flat until a saturation knee, beyond which tail latency rises sharply. The diagnostic question for serving is where that knee sits and what tail latency looks like just before it, which a single-stream benchmark cannot reveal.