Increasing GPU performance for AI workloads is not primarily about changing hardware — it’s about using the hardware you have more effectively. In our experience, most production AI inference systems operate at 30–60% GPU utilization when first deployed. Getting to 80–90% is almost always an engineering problem, not a budget problem. The techniques that actually move the needle, in rough order of impact, are: batch sizing, operator fusion, memory access optimization, and kernel occupancy tuning. Each addresses a different constraint. Applying the wrong fix for the bottleneck produces no improvement. Profile First This cannot be overstated. The approaches below target different bottlenecks — memory bandwidth, compute throughput, launch overhead, CPU synchronization. Without profiling, you don’t know which one limits your workload. The profiling workflow using Nsight Systems and Nsight Compute is covered in GPU Accelerating RF Signal Propagation Simulation, and the same methodology applies to any AI workload. The one-line decision: run Nsight Systems and confirm GPU utilization, then check if the idle periods are compute gaps, memory transfer stalls, or CPU-side overhead. What does this mean in practice? For most AI inference workloads, increasing batch size is the single highest-impact change for throughput. GPU hardware is designed for massively parallel execution. A batch of 1 leaves the vast majority of compute units idle. A batch of 32 amortizes kernel launch overhead and fills more of the available warp slots. The relationship between batch size and throughput follows a curve: Batch 1–4: Typically memory-bandwidth-limited, low arithmetic intensity. Most of the GPU is idle. Batch 8–32: Throughput increases near-linearly for many models. This is the efficient operating region for many inference scenarios. Batch 64–256: Compute-bound for most transformer models. Throughput increase slows as arithmetic intensity exceeds the memory bandwidth roof. Batch >256: Typically memory-bound again due to KV cache growth in LLMs; return to compute-bound for CNN architectures. The constraint is latency: higher batch size increases time-to-first-response. For latency-sensitive APIs (p99 < 100ms SLA), practical batch sizes are limited. For throughput-optimized offline inference, batch size should be pushed to the memory limit. How to find optimal batch size: import time import torch model = model.cuda().eval() for batch_size in [1, 2, 4, 8, 16, 32, 64, 128]: x = torch.randn(batch_size, *input_shape).cuda() # Warmup for _ in range(5): _ = model(x) torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(50): _ = model(x) torch.cuda.synchronize() elapsed = (time.perf_counter() - t0) / 50 print(f"Batch {batch_size}: {elapsed*1000:.1f}ms, {batch_size/elapsed:.0f} samples/s") Operator Fusion Unfused inference executes each operation as a separate GPU kernel: linear projection, activation, another linear projection, layer norm — each reads from and writes back to HBM. Fused kernels chain these operations, keeping intermediate results in registers or shared memory and eliminating multiple HBM round-trips. Concrete fusion opportunities for transformer inference: Unfused Operations Fused Version Benefit Q,K,V projection → attention → softmax → weighted sum FlashAttention 2–4x attention kernel speedup LayerNorm → linear projection Fused kernel (custom or Triton) 1.3–1.8x Element-wise activation + gate multiply (SwiGLU) Fused kernel 1.5–2x for this operation Residual add + LayerNorm apex FusedLayerNorm, or torch.compile 1.2–1.5x torch.compile with mode="reduce-overhead" or mode="max-autotune" performs automatic fusion using the inductor backend. This is the first thing to try before writing custom fused kernels: model = torch.compile(model, mode="max-autotune") In our experience, torch.compile delivers 15–40% throughput improvement on modern transformer architectures with no code changes beyond this single line. Kernel Occupancy Occupancy is the ratio of active warps to the maximum number of warps an SM can support. Low occupancy means the SM has idle cycles it cannot fill due to resource constraints (registers, shared memory, or block configuration). Check occupancy with Nsight Compute: look at the “Achieved Occupancy” metric and compare it to the theoretical maximum. If achieved occupancy is below 50%, investigate: Register pressure: Too many registers per thread limits how many threads can reside on an SM simultaneously. Compile with -maxrregcount=64 to cap registers and check if spilling occurs. Shared memory per block: Large shared memory allocations limit concurrent blocks per SM. Check with --ptxas-options=-v. Block size: Very small block sizes (e.g., 32 threads) waste scheduler slots. 128–256 threads per block is a common starting point. Occupancy is not always the binding constraint — a memory-bound kernel at 50% occupancy may already be saturating HBM bandwidth. Increasing occupancy improves throughput only for compute-bound kernels with insufficient warps to hide latency. Memory Coalescing For custom kernels or cases where profiling shows low memory throughput, check memory access coalescing. A coalesced access is when 32 consecutive threads (a warp) access 32 consecutive memory addresses — the GPU satisfies this with a single memory transaction. Signs of uncoalesced access: Nsight Compute shows L1/TEX Cache hit rate near 0% but L2 Cache hit rate also low Memory throughput percentage far below bandwidth ceiling despite memory-bound classification Global Memory Load Efficiency metric below 50% For row-major matrix access in column-major traversal order, or transpose operations without shared memory tiling, coalescing problems are common. The fix is to reorganize data layout or use shared memory as a staging buffer with coalesced loads. Asynchronous Data Loading A common bottleneck that profiling reveals but developers overlook: the GPU is idle because the next batch isn’t ready yet. The CPU is busy preprocessing or loading data while the GPU waits. Fix with CUDA streams and prefetching: # PyTorch DataLoader with pinned memory enables async H2D transfer loader = DataLoader( dataset, batch_size=32, num_workers=4, pin_memory=True, prefetch_factor=2 ) pin_memory=True allocates host memory in pinned (non-pageable) memory, enabling faster DMA transfers. prefetch_factor=2 pre-loads the next batch while the GPU processes the current one. Performance Improvement Checklist Profile with Nsight Systems — confirm GPU utilization and identify idle gaps Increase batch size to the maximum allowed by latency SLA and VRAM Apply torch.compile(model, mode="max-autotune") as the first code change Enable pinned memory and prefetching in the data pipeline Check for synchronous CUDA operations blocking CPU (.item(), .numpy() on GPU tensors) Profile specific slow kernels with Nsight Compute — check occupancy and memory efficiency For custom kernels: verify memory coalescing with Global Memory Load Efficiency metric Consider operator fusion for repeated sequences of element-wise operations In brief GPU performance improvement for AI starts with batch sizing (highest leverage, zero kernel work), proceeds through torch.compile-based operator fusion (low effort, significant gain), and then addresses specific kernel bottlenecks identified by profiling. Occupancy tuning and memory coalescing are meaningful only for compute-bound kernels where the profiling data confirms those are the binding constraints. Profiling before every optimization is not optional — it’s how you avoid spending a week optimizing the wrong kernel.