A PyTorch version bump changed which attention kernel was selected. Throughput moved by 30%.
Nothing else changed — same model, same hardware, same configuration, same data. The framework updated, the kernel dispatch logic chose a different code path, and the measured throughput shifted by nearly a third. If you didn’t know about the version change, you’d think the hardware regressed.
This kind of event is not rare. It’s the normal texture of running AI workloads on real software stacks, and it illustrates something that many performance discussions still treat as a footnote: the software stack is not plumbing. It’s a first-class performance component — one that routinely determines whether the hardware’s capabilities are exploited, partially used, or effectively walled off.
Software doesn’t just “use” the GPU — it decides what runs
In performance conversations, hardware gets the headline and software gets treated as background plumbing. “This GPU achieves X throughput” — as if the throughput was a property of the silicon that the software merely observed.
But the software stack makes the decisions that shape execution — and those decisions are where CUDA, frameworks, and ecosystem lock-in become strategically important. The framework decides how the computational graph is constructed, which operators exist, and what fusion opportunities are available. A graph compiler like torch.compile or XLA decides which operations to merge, what memory layout to use, and how to schedule work. The CUDA runtime manages kernel launches, memory allocation, and stream synchronization. And libraries like cuDNN or FlashAttention provide the actual kernel implementations that run on the hardware.
These aren’t incidental choices — they’re the mechanics that determine which execution path the workload takes through the hardware. Change any one of them and you can shift the bottleneck from compute to memory, change the memory access pattern from cache-friendly to cache-hostile, or replace a fast fused kernel with a sequence of slower unfused ones. The hardware is the same. The outcome isn’t.
Software creates real performance ceilings
The strongest version of this argument is that software doesn’t just influence performance — it can define the ceiling.
If the stack can’t generate an efficient execution plan for your workload, the GPU’s peak capability becomes irrelevant in a very practical sense — another expression of the fact that GPUs are part of a larger system. You’re not “leaving performance on the table” as some optimization opportunity for later; you’re running into a software-defined limit that the hardware cannot override, because the hardware only executes what the software gives it.
We see this concretely when a framework lacks an optimized kernel for a particular operation — a custom attention variant, an unusual normalization, a non-standard activation function. The fallback path might be functionally correct but 5× slower than what the hardware could theoretically sustain. The GPU has the capability. The software doesn’t use it. The measured throughput reflects the software’s ceiling, not the hardware’s.
Two frameworks can produce meaningfully different performance on the same hardware with the same model precisely because of this: they’re making different execution decisions, and those decisions create different ceilings.
Drivers and runtimes: invisible and non-trivial
People tend to underestimate the lower layers because they’re invisible at the model level. You define your model in Python, call model.forward(), and something happens on the GPU. The details of kernel scheduling, memory allocation strategy, stream synchronization, and launch overhead feel like someone else’s problem.
They are — until they become your performance bottleneck.
CUDA driver updates change scheduling heuristics. Runtime versions alter memory pool behavior, kernel selection priorities, and synchronization semantics. Even without changing a single line of application code, a driver upgrade can shift where time is spent during execution. We’ve seen cases where a driver update improved throughput on one model family and degraded it on another, because the scheduling change favored one operator pattern at the expense of a different one.
This doesn’t mean newer is always worse (or always better). It means that stability assumptions about the lower stack layers are a form of implicit performance claim — and like all performance claims, they need to be validated rather than assumed. As the discussion of why identical GPUs diverge illustrates, software version differences are one of the most common and most underestimated sources of performance variance in real deployments.
Framework-level decisions change what runs, not just how
At the framework level, the effects become even more visible. Frameworks don’t just execute your model — they transform it.
Graph-level optimizations decide which operators to fuse, how to lay out tensors in memory, whether to batch certain operations, and which backend kernels to dispatch. A PyTorch model running with torch.compile may take a fundamentally different execution path than the same model running eagerly. A switch from a standard SDPA implementation to FlashAttention v2 can change memory access patterns, reduce peak memory usage, and substantially alter throughput — all without changing the model’s mathematical definition.
So when a benchmark reports “X tokens per second on model Y,” the implicit context includes which framework version was running, what compilation or optimization passes were applied, and which kernels ended up executing. All of that is part of the measurement. If it’s not part of the report, the number is incomplete.
What this means for credible comparisons
If the software stack is part of the performance outcome — and it is — then credible comparisons need to include it. Operationally, that means performance ownership spans hardware and software teams.
A comparison that says “GPU A is faster than GPU B on model X” but doesn’t report the framework version, compilation settings, kernel libraries, and driver versions for both systems has told you which system was faster under those unspecified conditions. It hasn’t told you why, and it hasn’t told you whether the result would hold if you controlled the software layer.
This isn’t an argument for dismissing comparisons that don’t report every detail. It’s an argument for interpreting them correctly: as observations of specific executions, not as universal claims about hardware. The more context is visible, the more useful the comparison becomes. And the most useful context to make visible is usually the software stack — because it’s the part that people most often treat as background, and it’s the part that most often explains divergent results.
Building on the principle that performance emerges from the full stack, treating the software layer as first-class isn’t an ideology — it’s a practical requirement for avoiding conclusions that won’t survive a deployment.