Performance Engineering for Scalable Deep Learning Systems

Performance engineering for deep learning is mostly an exercise in honesty about where the time actually goes. Most teams scaling PyTorch or TensorFlow workloads do not have a hardware shortage — they have a utilisation problem. The accelerators are present, the budget is approved, and yet the wall-clock time per epoch refuses to drop in proportion to the cluster size. Before you provision another node, profile the one you already paid for.

This is the unglamorous part of the discipline. It is also where the cost compounds, because every monthly cloud invoice quietly charges for capacity the workload never touched.

What performance engineering actually optimises

The phrase “performance engineering” gets used loosely. In a deep-learning context it has a tighter meaning: the systematic work of aligning a training or inference workload with the compute, memory, and interconnect resources it runs on. Throughput is the visible target. The real target is useful FLOPs per dollar — the share of purchased compute that ends up advancing the gradient.

Three layers carry the work:

Algorithmic level. Mixed precision, gradient checkpointing, activation recomputation, sharding strategy. These trade memory for compute or vice versa.
Framework level. Kernel fusion, graph compilation (torch.compile, XLA), data-loader parallelism, communication overlap. PyTorch and TensorFlow expose these knobs; few teams turn them all.
System level. NCCL topology, NVLink vs PCIe paths, NUMA placement, storage bandwidth into the data loader, batch sizing relative to memory bandwidth.

A single slow layer dominates the rest. If your data loader cannot keep the GPU fed, no amount of FlashAttention will help. We see this pattern regularly in audits: GPU-busy percentage looks fine, SM occupancy is low, and the kernels themselves are stalled waiting on HBM. The dashboard reports green; the wallet disagrees.

Why “GPU utilisation” is the wrong number to trust

The most common failure mode is taking the GPU-busy percentage at face value. nvidia-smi reports a kernel is running. It does not report whether that kernel is doing useful work or spinning on a memory-bound load. In our experience across GPU infrastructure engagements, “100% utilisation” frequently coexists with sub-30% achieved arithmetic intensity. We explore the structural reasons in the hidden cost of GPU underutilisation, and the diagnostic pattern in utilisation, bottlenecks, and the illusion of idle GPUs.

The reliable signals are different:

Signal	What it tells you	Tool
Achieved TFLOP/s vs peak	Whether compute is the bottleneck at all	Nsight Compute, PyTorch Profiler
HBM read/write bandwidth	Memory-bound vs compute-bound regime	Nsight Compute, `dcgmi`
NCCL collective time / step	Communication overhead in distributed training	NCCL traces, PyTorch Profiler
Data loader idle time	Whether the host pipeline starves the device	`torch.profiler` with CPU activities

This is an observed pattern from operational measurements, not a published benchmark — every workload behaves differently, and the only way to know yours is to profile it.

Where distributed training quietly bleeds

Scaling from one device to dozens introduces failure modes that do not exist on a single GPU. Three keep recurring:

Collective communication stalls. All-reduce time grows with cluster size and with payload. If gradients are not bucketed correctly, or if the topology forces traffic across slower PCIe links instead of NVLink, the optimiser step becomes the bottleneck. NCCL’s NCCL_DEBUG=INFO and a careful look at the topology file are the starting point.
Load imbalance under sharding. Tensor parallelism and pipeline parallelism both assume balanced shards. Real models — especially those with mixture-of-experts layers or uneven attention heads — produce stragglers. One slow shard sets the step time for the whole cluster.
Storage and host-side starvation. Distributed jobs amplify any I/O weakness. A data loader that copes with one GPU collapses when eight workers hit the same NFS mount. Sharded data loading, pre-tokenised on-disk formats (WebDataset, MosaicML’s streaming format), and pinned-memory transfers are the standard mitigations.

None of these are exotic. They are the boring middle of the work, which is why they get skipped.

How does framework-level tuning actually help?

The frameworks ship with capable defaults, but the defaults are conservative. A handful of changes pay back quickly on most workloads:

Mixed precision (FP16 or BF16) roughly doubles arithmetic throughput on tensor-core hardware and halves activation memory. The accuracy cost is usually within noise; see mixed precision works by exploiting numerical tolerance for why.
Graph compilation (torch.compile, tf.function with XLA) fuses small kernels into larger ones and reduces Python-side dispatch overhead. The gain is largest on models with many small operators.
Gradient checkpointing trades a recomputation pass for a reduction in activation memory, which lets you grow batch size and reclaim arithmetic intensity.
Communication overlap. Frameworks expose ways to start the all-reduce on early gradients before the backward pass finishes. The default is often off.

The right combination depends on whether the workload is compute-, memory-, or communication-bound. Profile first.

A practical sequence before procuring more capacity

When an infrastructure lead asks whether the next quarter needs more GPUs, the answer is rarely “no” outright — but it is rarely “yes” without this sequence either:

Capture a profiler trace of a representative training step on the current fleet.
Compute achieved TFLOP/s and compare to the device’s peak for the relevant precision.
If achieved is below ~40% of peak, the next dollar belongs to optimisation, not procurement.
Quantify the gap. The cost framing — TCO per useful FLOP, not per purchased FLOP — is the conversation we develop in the hidden cost of GPU underutilisation.
Only then size the additional capacity, using the corrected throughput as the planning baseline.

This is not a universal rule. Some workloads are already well-tuned and genuinely need more silicon. The point is that the sequence keeps you from paying twice for the same shortfall.

FAQ

How do I calculate the true cost of an underutilised GPU fleet? Multiply hourly cost by hours running, then divide by the achieved TFLOP/s rather than the device’s peak TFLOP/s. The ratio of achieved to peak tells you what fraction of the bill is actually buying computation. The rest is waste.

What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading? The busy percentage reports whether a kernel is scheduled on the device. It does not report achieved arithmetic intensity or memory bandwidth. A memory-bound kernel can pin the busy percentage at 100% while delivering a small fraction of peak FLOPs.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP? Take the standard TCO components — hardware amortisation or cloud rental, power, cooling, operations — and divide by the operationally measured throughput of your actual workload, not the spec-sheet peak. The gap between the two ratios is the optimisation headroom.

Which workload patterns most often leave GPU capacity on the table? Small batch sizes relative to memory capacity, data-loader starvation, unfused small kernels, communication-bound distributed training, and mixed-precision left on FP32 defaults. Each is a recurring pattern in our audits.

Should I procure additional GPU capacity or first profile the utilisation of what I have? Profile first when achieved utilisation has never been measured. The cost of a profiling pass is hours; the cost of a wrong procurement decision is months of cloud spend or a fixed-asset commitment that does not match the workload.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs? We avoid a universal number — savings vary with the starting state of the workload. The honest framing is that workloads which have never been profiled almost always have headroom worth measuring before the next procurement cycle.

Performance engineering is less about finding clever optimisations and more about refusing to scale a workload you have not measured. The audit comes first.

Performance Engineering for Scalable Deep Learning Systems

What performance engineering actually optimises

Why “GPU utilisation” is the wrong number to trust

Where distributed training quietly bleeds

How does framework-level tuning actually help?

A practical sequence before procuring more capacity

FAQ

The Hidden Cost of GPU Underutilisation

Low GPU Utilization: Where the Real Bottlenecks Hide

GPU Performance Settings for AI: Persistence Mode, Power Limits, MIG, and NUMA Pinning

GPU vs TPU vs CPU: Performance and Efficiency Explained