Burst benchmarks overstate AI capacity This is something we pay close attention to in our benchmarking work. When teams benchmark a PC for AI workloads, they typically run a short test (30–120 seconds) and record the throughput. The result — peak burst performance — overstates the sustained capacity that matters for production workloads. AI training runs for hours. Inference servers run continuously. The relevant performance metric for capacity planning is steady-state throughput: what the system delivers over an extended run, after thermals have stabilized and transient effects have dissipated. Why steady-state differs from burst Thermal throttling: GPUs and CPUs boost clock speeds when cool, then reduce them as die temperatures rise. A typical GPU reaches thermal equilibrium after 5–15 minutes. The performance at equilibrium — which may be 10–30% lower than peak — is the capacity planning number. Memory pressure: Extended training runs fill GPU memory with gradients, optimizer state, and activation buffers. Memory allocation patterns at steady state differ from the first few iterations. Data loading equilibrium: I/O pipelines take time to saturate. Benchmark samples from the first 60 seconds may include the startup ramp before data loading is fully pipelined. Steady-state benchmark protocol import torch import time import subprocess def get_gpu_temp(): result = subprocess.run(['nvidia-smi', '--query-gpu=temperature.gpu', '--format=csv,noheader,nounits'], capture_output=True, text=True) return float(result.stdout.strip()) # Load model model = load_your_model().cuda().half() model.eval() # Warmup phase (allow thermals to stabilize) print("Warming up...") start_warmup = time.time() while time.time() - start_warmup < 300: # 5 minutes warmup with torch.no_grad(): output = model(sample_input) temp_at_steady_state = get_gpu_temp() print(f"GPU temperature: {temp_at_steady_state}°C") # Measurement phase print("Measuring steady-state throughput...") samples_processed = 0 start_measure = time.time() while time.time() - start_measure < 600: # 10 minutes measurement with torch.no_grad(): output = model(sample_input) samples_processed += batch_size elapsed = time.time() - start_measure steady_throughput = samples_processed / elapsed print(f"Steady-state throughput: {steady_throughput:.0f} samples/sec") Recording the benchmark A complete steady-state benchmark report should include: Metric Why it matters Burst throughput (first minute) Context for comparison Steady-state throughput (after 5+ min warmup) Capacity planning number Throughput ratio (steady/burst) Throttling severity GPU temperature at steady state Thermal headroom GPU power consumption (watts) Operating cost VRAM utilization Model fit margin A throttling ratio below 0.85 (steady-state < 85% of burst) indicates significant thermal constraints that may require active cooling improvements before deploying as production infrastructure. Steady-state performance, cost, and capacity planning covers how to translate steady-state performance measurements into accurate capacity planning decisions. How long should a steady-state benchmark run? The minimum useful duration for a steady-state AI benchmark is 20 minutes from cold start. The first 5–10 minutes represent the transient phase: GPU clocks ramp to boost frequency, thermal management activates, and power delivery stabilises. Data collected during this phase does not represent production performance. The steady-state window begins when throughput variation drops below 3% between consecutive 60-second measurement intervals. For most desktop and workstation GPUs, this occurs between 5 and 10 minutes from cold start. For data centre GPUs with active liquid cooling, steady-state may arrive within 3 minutes. For laptops with constrained cooling, it may take 15 minutes or longer as thermal throttling progressively reduces clock speeds. We collect three data series during the benchmark: throughput (samples/second or tokens/second), GPU temperature (°C), and GPU power draw (watts). Plotting all three on the same time axis reveals the thermal story: steady temperature with steady throughput indicates adequate cooling, rising temperature with declining throughput indicates thermal throttling, and power limit capping (visible as a power ceiling) indicates that the power delivery system is constraining performance. The steady-state throughput number is what we use for capacity planning. If a server needs to handle 1,000 inference requests per second and the steady-state benchmark shows 250 requests/second per GPU, you need at minimum 4 GPUs — plus headroom for traffic spikes. Using burst throughput for this calculation would undersize the deployment by 10–25%. Interpreting thermal behaviour during benchmarks The temperature curve during a steady-state benchmark reveals cooling adequacy. Three patterns to recognise: stable temperature below 80°C indicates good cooling — the system can sustain this workload indefinitely. Temperature rising to 83–85°C then stabilising indicates adequate but marginal cooling — the GPU is thermally limited but not throttling. Temperature rising above 85°C with throughput declining indicates thermal throttling — the cooling system cannot dissipate the GPU’s heat output at full power. For workstation deployments, we target steady-state temperatures below 80°C to provide thermal headroom for ambient temperature variation (a data centre at 25°C is cooler than a desk-side workstation in a 28°C office). For data centre deployments with controlled ambient temperature, steady-state temperatures up to 83°C are acceptable. The power consumption during steady-state also reveals whether the GPU is operating at its configured power limit or throttling below it. An RTX 4090 at its default 450W TDP that reports only 380W during a sustained AI workload is being limited by something other than the power configuration — typically thermal throttling or a PSU limitation.