The upgrade that fixed nothing
An infrastructure team provisions new GPUs after months of justification. The hardware arrives, gets racked, passes burn-in. The ML team migrates their training pipeline. Throughput barely improves. The infrastructure team insists the hardware is performing to spec. The ML team insists the workload should be faster. Both are correct, and neither can fix the problem alone.
This scenario recurs across organizations because of a structural mismatch: AI performance is a cross-boundary property, but organizational ownership is usually siloed. Hardware teams own procurement, provisioning, and physical infrastructure. Software teams own models, frameworks, and application code. Performance — the thing the business actually cares about — lives in the interaction between these layers, in a gap that neither team is chartered to own.
Why attribution fails
When a training job runs slower than expected, the diagnostic instinct is to isolate the cause. Is the GPU underperforming? Is the framework misconfigured? Is the data pipeline starved? These are reasonable questions, but they assume the root cause lives in one place. For AI workloads, it usually doesn’t.
A training pipeline’s throughput is jointly determined by GPU kernel efficiency, data loading speed, gradient synchronization latency, memory allocation patterns, framework-level scheduling decisions, and driver behavior. Changing any one of these changes the observed performance. The “cause” of poor performance is rarely a single component failure — it’s a system configuration that produces an unfavorable interaction between components.
We’ve explored this dynamic from the measurement perspective in why performance emerges from the full hardware-software stack: a benchmark result reflects the entire execution context, not just the hardware or just the software. The same principle applies to performance problems. They reflect the entire execution context, and attributing them to one layer is usually a simplification.
Hardware upgrades don’t fix software-limited systems
This is the most expensive version of the attribution problem — and the pattern extends well beyond the single-team scenario above.
Common scenarios:
A training job is bottlenecked on CPU-side data preprocessing. The GPUs are idle between batches, waiting for the data loader. New GPUs idle faster, but the throughput improvement is negligible because the constraint was always host-side.
An inference service has high tail latency, but the GPU kernel completes in under 2ms. The latency comes from request queuing, tokenization, and framework overhead — all CPU-bound operations that don’t benefit from faster accelerators.
A distributed training job scales poorly because gradient synchronization saturates the network interconnect. Faster GPUs complete the compute phase sooner, which means they spend an even larger fraction of wall-clock time waiting for communication. Scaling “efficiency” actually decreases with the upgrade.
In each case, the hardware procurement was defensible on paper. The GPU was older, and newer GPUs are faster. But “the GPU is faster” and “the system will be faster” are different claims, and the gap between them is where budget gets wasted.
Performance engineering is a discipline, not a role
The organizational response to cross-boundary performance problems is often to assign ownership: “Make the ML platform team responsible for end-to-end performance.” This works on an org chart, but it doesn’t work in practice unless that team has the authority and expertise to intervene across the full stack — from kernel tuning and framework configuration to interconnect topology and driver versions.
Performance engineering in AI systems requires a specific combination of skills: understanding GPU architecture well enough to interpret profiler output, understanding the framework well enough to diagnose dispatch and scheduling behavior, understanding the system well enough to identify data movement bottlenecks, and understanding the workload well enough to know what “good performance” actually looks like.
That skill set rarely lives in a single team. It’s distributed across infrastructure, platform, ML engineering, and operations. When those teams collaborate effectively — sharing profiling data, aligning on performance targets, and diagnosing problems jointly rather than throwing them over organizational walls — performance improves. When they don’t, the organization gets the “upgrade that fixed nothing” pattern on repeat.
As discussed in the context of GPUs as components within larger systems, the accelerator is one element of a system whose performance depends on balance and integration. The organizational structure that manages that system needs to reflect the same integration.
What this implies for benchmarking
If performance is a cross-boundary property, then evaluating performance is also a cross-boundary activity. A benchmark that measures GPU compute throughput in isolation tells you something real about the hardware, but it tells you very little about what the system will deliver under production conditions.
Useful performance measurement requires capturing the interactions: how does throughput change as the data pipeline scales? What happens to latency when the system is under concurrent load? How does gradient synchronization interact with compute overlap at different model sizes?
These are system-level questions, and answering them requires cooperation between the people who understand the hardware, the people who understand the software, and the people who understand the workload. The measurement methodology, just like the performance it measures, spans the full stack.
The gap is the job
In mature organizations, the gap between hardware and software teams isn’t a problem to be solved by restructuring — it’s the actual domain of performance engineering. The engineers who produce the best outcomes are the ones who can work across that gap: reading GPU profiler traces alongside framework-level timelines, correlating network utilization with training throughput curves, connecting application-level SLA requirements to infrastructure-level capacity decisions.
Acknowledging that performance ownership spans teams is the prerequisite for building the diagnostic habits and collaborative practices that actually improve outcomes. Hardware alone won’t do it. Software alone won’t do it. The intersection is where the leverage lives.