The Software Stack Is a First-Class Performance Component

A PyTorch version bump changed which attention kernel was selected. Throughput moved by 30%.

Nothing else changed — same model, same hardware, same configuration, same data. The framework updated, the kernel dispatch logic chose a different code path, and the measured throughput shifted by nearly a third. If you didn’t know about the version change, you’d think the hardware regressed.

This kind of event is not rare. It’s the normal texture of running AI workloads on real software stacks, and it illustrates something that many performance discussions still treat as a footnote: the software stack is not plumbing. It’s a first-class performance component — one that routinely determines whether the hardware’s capabilities are exploited, partially used, or effectively walled off.

Software doesn’t just “use” the GPU — it decides what runs

In performance conversations, hardware gets the headline and software gets treated as background plumbing. “This GPU achieves X throughput” — as if the throughput was a property of the silicon that the software merely observed.

But the software stack makes the decisions that shape execution — and those decisions are where CUDA, frameworks, and ecosystem lock-in become strategically important. The framework decides how the computational graph is constructed, which operators exist, and what fusion opportunities are available. A graph compiler like torch.compile or XLA decides which operations to merge, what memory layout to use, and how to schedule work. The CUDA runtime manages kernel launches, memory allocation, and stream synchronization. And libraries like cuDNN or FlashAttention provide the actual kernel implementations that run on the hardware.

These aren’t incidental choices — they’re the mechanics that determine which execution path the workload takes through the hardware. Change any one of them and you can shift the bottleneck from compute to memory, change the memory access pattern from cache-friendly to cache-hostile, or replace a fast fused kernel with a sequence of slower unfused ones. The hardware is the same. The outcome isn’t.

Software stack decisions that shape GPU performance

Stack component	Decision it makes	Performance consequence
Framework (PyTorch, JAX)	Graph construction, operator selection, memory layout	Determines which execution path the workload takes
Graph compiler (torch.compile, XLA)	Operator fusion, scheduling, backend kernel selection	Can shift the workload between compute-bound and memory-bound regimes
Kernel libraries (cuDNN, FlashAttention)	Actual implementation of operations on the device	Optimized vs. fallback kernels can differ by roughly 2–5× on the same hardware
CUDA runtime	Memory allocation, kernel launch, stream synchronization	Launch overhead and allocation policy affect whether the GPU stays saturated
Drivers	Scheduling heuristics, power management, resource arbitration	Version changes can alter performance without any application-level change

Software creates real performance ceilings

The strongest version of this argument is that software doesn’t just influence performance — it can define the ceiling.

If the stack can’t generate an efficient execution plan for your workload, the GPU’s peak capability becomes irrelevant in a very practical sense — another expression of the fact that GPUs are part of a larger system. You’re not “leaving performance on the table” as some optimization opportunity for later; you’re running into a software-defined limit that the hardware cannot override, because the hardware only executes what the software gives it.

We see this concretely when a framework lacks an optimized kernel for a particular operation — a custom attention variant, an unusual normalization, a non-standard activation function. The fallback path might be functionally correct but 5× slower than what the hardware could theoretically sustain. The GPU has the capability. The software doesn’t use it. The measured throughput reflects the software’s ceiling, not the hardware’s.

Two frameworks can produce meaningfully different performance on the same hardware with the same model precisely because of this: they’re making different execution decisions, and those decisions create different ceilings.

Drivers and runtimes: invisible and non-trivial

People tend to underestimate the lower layers because they’re invisible at the model level. You define your model in Python, call model.forward(), and something happens on the GPU. The details of kernel scheduling, memory allocation strategy, stream synchronization, and launch overhead feel like someone else’s problem.

They are — until they become your performance bottleneck.

CUDA driver updates change scheduling heuristics. Runtime versions alter memory pool behavior, kernel selection priorities, and synchronization semantics. Even without changing a single line of application code, a driver upgrade can shift where time is spent during execution. We’ve seen cases where a driver update improved throughput on one model family and degraded it on another, because the scheduling change favored one operator pattern at the expense of a different one.

This doesn’t mean newer is always worse (or always better). It means that stability assumptions about the lower stack layers are a form of implicit performance claim — and like all performance claims, they need to be validated rather than assumed. As the discussion of why identical GPUs diverge illustrates, software version differences are one of the most common and most underestimated sources of performance variance in real deployments.

Framework-level decisions change what runs, not just how

At the framework level, the effects become even more visible. Frameworks don’t just execute your model — they transform it.

Graph-level optimizations decide which operators to fuse, how to lay out tensors in memory, whether to batch certain operations, and which backend kernels to dispatch. A PyTorch model running with torch.compile may take a fundamentally different execution path than the same model running eagerly. A switch from a standard SDPA implementation to FlashAttention v2 can change memory access patterns, reduce peak memory usage, and substantially alter throughput — all without changing the model’s mathematical definition.

So when a benchmark reports “X tokens per second on model Y,” the implicit context includes which framework version was running, what compilation or optimization passes were applied, and which kernels ended up executing. All of that is part of the measurement. If it’s not part of the report, the number is incomplete.

What does this mean for credible performance comparisons?

If the software stack is part of the performance outcome — and it is — then credible comparisons need to include it. Operationally, that means performance ownership spans hardware and software teams.

A comparison that says “GPU A is faster than GPU B on model X” but doesn’t report the framework version, compilation settings, kernel libraries, and driver versions for both systems has told you which system was faster under those unspecified conditions. It hasn’t told you why, and it hasn’t told you whether the result would hold if you controlled the software layer.

This isn’t an argument for dismissing comparisons that don’t report every detail. It’s an argument for interpreting them correctly: as observations of specific executions, not as universal claims about hardware. The more context is visible, the more useful the comparison becomes. And the most useful context to make visible is usually the software stack — because it’s the part that people most often treat as background, and it’s the part that most often explains divergent results.

Building on the principle that performance emerges from the full stack, treating the software layer as first-class isn’t an ideology — it’s a practical requirement for avoiding conclusions that won’t survive a deployment.

LynxBenchAI treats the software layer as a first-class variable in every measurement — not background noise to be controlled for, but a declared component of the reported result. It is a benchmarking methodology for AI hardware — measuring sustained performance across the complete hardware-and-software stack, reported per precision, with bounded optimisation.

Frequently Asked Questions

How do GPU drivers and runtimes directly shape execution performance?

Drivers control scheduling heuristics, power management, and resource arbitration; runtimes manage kernel launches, memory allocation, and stream synchronization. Together they decide how the work you submit actually reaches the device and how time is spent during execution. A driver or runtime update can move that distribution without any change to your application code, which is why we treat their versions as part of the measurement context, not as stable plumbing.

Why can a CUDA, driver, or framework version change move benchmark results by a large margin on the same hardware?

Because each of those layers makes execution decisions — kernel dispatch, operator fusion, memory layout, launch policy — and a version change can pick a different path through the same hardware. The PyTorch update in the opening example shifted attention-kernel selection and moved throughput by roughly 30% with nothing else changed. The hardware ceiling never moved; the software’s chosen path did, and the measured number followed.

In what cases does upgrading software actually reduce performance instead of improving it?

When a new version changes scheduling heuristics, kernel selection priorities, or fusion patterns in a way that favours one operator mix at the expense of another. We’ve seen driver updates that improved throughput on one model family and degraded it on another for exactly that reason. Newer is neither universally faster nor universally slower — which is why stability assumptions about lower stack layers have to be validated like any other performance claim.

How does framework behaviour affect kernel scheduling and memory use on the GPU?

Frameworks transform the model before it runs: they decide which operators to fuse, how to lay out tensors in memory, whether to batch operations, and which backend kernels to dispatch. A model running under torch.compile takes a different execution path than the same model running eagerly, and a switch from a standard SDPA to FlashAttention v2 changes memory access patterns and peak memory usage. The mathematical definition is identical; the kernel schedule and memory profile are not.

Why are software ceilings often the binding constraint on GPU performance, not the hardware ceilings?

Because the hardware only executes what the software gives it. If the stack can’t generate an efficient plan for your workload — say, a custom attention variant with no optimised kernel — the fallback path can be several times slower than what the device could theoretically sustain. The capability is present in the silicon and unreachable in practice. The measured throughput in that case reflects the software’s ceiling; quoting the hardware’s peak is misleading.

What does it mean to treat the software stack as a first-class performance component when reading a benchmark?

It means treating framework version, compilation settings, kernel libraries, and driver versions as part of the reported result, not as background to be assumed away. A number without that context tells you which system was faster under unspecified conditions — useful as an observation of a specific execution, not as a universal claim about hardware. LynxBench AI is built around that discipline: the software layer is a declared variable, reported alongside the throughput it produced.

Methodology anchor — the AI Executor is the K2 primitive

This hub owns the unit-of-measurement question: AI performance is a property of the (hardware, driver, runtime, framework, kernel-library) tuple, not of the silicon alone. The same chip under a different driver branch, a different compilation pass, or a different kernel selection is a different executor — and a number reported against one executor is not portable to another without re-measurement. K2’s contribution to the methodology graph is to force every benchmark claim to disclose its executor tuple and to treat any single-component substitution (a driver bump, a framework version change, an alternate attention kernel) as an explicit perturbation that must be re-reported. The right question to put to any benchmark you read is the K2 one: which executor tuple produced this number, and how would it move if any element of the tuple changed?