A driver update changes kernel scheduling. Throughput drops 12%. Nobody touched the model.
This is the kind of event that makes sense only after you abandon one mental model and replace it with another. Under the old model — “hardware has performance, software just uses it” — a driver update shouldn’t change throughput by double digits. Under the replacement model, it’s not only possible but predictable, because performance was never a property of the hardware alone.
AI performance is an emergent property of the hardware, the software stack that drives it, and the workload that shapes what both of them do. You don’t get a performance outcome by summing up layer contributions. You get it from interactions — and those interactions can be surprisingly sensitive to changes in any single layer. The stack, not the device, is the correct unit of performance reasoning for AI systems.
Emergence, not aggregation
When we say performance “emerges” from the hardware × software stack, we mean something specific: the final throughput, latency, or efficiency number is not decomposable into independent contributions from hardware and software. You can’t say “the GPU contributed 70% of the performance and PyTorch contributed 30%.” That’s not how the system works.
Consider what happens during a single forward pass of a transformer model. The framework decides how to lower the computation graph — which operators to fuse, which memory layout to use, which kernels to dispatch. The CUDA runtime schedules those kernels onto the device. cuDNN or a custom attention kernel (say, FlashAttention) determines the actual execution path on the hardware. The memory subsystem serves data according to access patterns that the software stack created. Thread scheduling, synchronization points, and stream management all shape how the hardware’s resources are actually consumed.
Change any one of those decisions — swap a kernel, alter a graph transformation pass, modify the runtime’s allocation strategy — and you’re not just “tuning.” You’re changing what the hardware actually does, which changes where the bottleneck lands, which changes the measured outcome. The coupling between layers is where the performance story lives, not in any single layer’s properties.
Why hardware-only reasoning keeps producing surprises
Hardware matters. Nobody serious argues otherwise, and framing the argument as “hardware vs. software” misses the point entirely. The problem is more specific: treating hardware as the explanatory unit for performance outcomes.
That looks like “we upgraded from GPU A to GPU B — we should see a proportional speedup” and then being puzzled when the gain is smaller than expected, varies by model, or disappears under sustained load. It looks like comparing two systems by their theoretical FLOPS ratio and finding the actual throughput ratio is nothing like it. It looks like a team purchasing hardware based on a single spec-sheet advantage and then discovering that the workload spends most of its time somewhere the spec sheet doesn’t describe.
We’ve found that when a team is surprised by an AI performance result, the explanation almost always involves a software or system-level interaction that the hardware-only model can’t account for. The hardware didn’t fail to perform — the hardware was never the whole story.
The stack is the performance definition
A pragmatic shift is to stop asking “what GPU is this?” and start asking “what execution stack is this?” — because the stack determines what actually runs.
The drivers and runtime decide how work is scheduled, synchronized, and allocated. The framework decides which operators execute and how graphs are partitioned. It’s a reminder that GPUs are part of a larger system, not isolated performance islands. Libraries and kernels decide what instructions actually hit the device. The system topology — PCIe layout, NUMA configuration, NVLink connectivity — decides how data moves between components. And the workload itself decides what gets stressed, for how long, and in what pattern.
None of that is optional detail. Those are the mechanisms that produce the number. When someone presents a performance result without stack context, the number isn’t wrong — the claim is just incomplete, in the same way that a benchmark result with hidden methodology is a datapoint without interpretation. As we explored in thinking about why identical GPUs can produce different results, the execution context is frequently the dominant source of variance, not the hardware identity.
What this means for performance claims and comparisons
If performance is a stack property, then performance claims need stack context to be meaningful. “This GPU delivers 500 tokens per second” is not a hardware statement — it’s a system statement, and the system includes the framework version, the CUDA runtime, the kernel libraries, the model configuration, the batch size, and the operating regime.
Teams that adopt the stack model gain a practical advantage: their performance discussions get calmer and more accurate, because performance ownership spans hardware and software teams. Discrepancies stop looking like mysteries and start looking like the natural consequence of running different stacks. Vendors’ claims become interpretable rather than confusing. And capacity planning shifts from “buy the GPU with the best number” to “validate performance under our actual execution context” — which is harder, but much more likely to produce useful results.
For a deeper look at how the software layer specifically creates performance ceilings and pathways, see our discussion of the software stack as a first-class performance component.
Not reductionism in reverse
One misreading of this argument is “so you’re saying hardware doesn’t matter and it’s all software.” That’s wrong in the opposite direction.
Hardware determines the envelope of what’s possible. A GPU with more memory can serve larger models; a device with higher bandwidth can move data faster when the access patterns are favorable; architectural features like hardware support for specific precision formats create capabilities the software stack can exploit. None of that is diminished by recognizing that the software stack mediates how those capabilities are realized.
The point is not that one layer matters more than the other. It’s that the outcome belongs to the interaction, and reasoning about the layers in isolation produces incomplete and often misleading conclusions. If your performance model doesn’t include the stack, it’s not a performance model — it’s a hardware description with an implicit hope that everything else will be fine. In our experience, that hope is not a reliable engineering strategy.