There is only one defensible basis for AI performance claims
You can read spec sheets. You can study benchmark leaderboards. You can talk to vendors, compare theoretical peak numbers, and build spreadsheets that make the comparison look tidy. All of those activities can be informative, and none of them constitute a performance measurement.
Performance is what happens when your workload runs through your stack, on your system, under your operating conditions. If you haven’t executed that — or something genuinely representative of it — you don’t have performance data. You have expectations, estimates, and in some cases well-informed guesses. But a guess, however well-informed, is not the same thing as an observation.
This might sound like stating the obvious, and in some engineering disciplines it would be. But in AI infrastructure decisions, the gap between “we estimated performance from external data” and “we measured performance under representative conditions” is where a large fraction of procurement regrets and deployment surprises originate.
“Representative” matters more than “benchmark”
When people hear “empirical measurement,” the immediate response is usually “we need a benchmark.” That’s not wrong, but it skips the harder part.
The hard question isn’t how to measure — tooling for running workloads and collecting metrics exists. The hard question is whether the thing you measured tells you anything about the thing you actually care about. A benchmark that exercises a workload profile, batch configuration, and precision mode that differ meaningfully from your production regime can produce a perfectly valid number that has no bearing on your actual outcome.
We see this pattern regularly: a team evaluates GPU options using a standard model at a standard batch size, gets clean comparative results, makes a procurement decision, and then discovers that their real workload — with its particular sequence length distribution, its concurrency pattern, its framework-specific graph transformations via torch.compile or TensorRT — behaves nothing like the evaluation did. The measurement was real. The representativeness was not.
The lesson isn’t “don’t benchmark.” It’s “make sure the benchmark exercises the regime you’ll actually operate in.” That takes more effort than downloading a standard test, but it’s the difference between information and false confidence.
Why synthetic and peak measurements aren’t enough on their own
Peak metrics and synthetic microbenchmarks have their uses — they can reveal hardware limits, isolate particular subsystems, and help debug specific bottlenecks. What they can’t do is stand in for workload-level performance.
A synthetic memory bandwidth test tells you how fast the memory subsystem can move data under idealized access patterns. It doesn’t tell you how fast your transformer model’s attention mechanism will access memory through the actual kernel your framework selects. A peak FLOPS benchmark tells you the arithmetic ceiling; it doesn’t tell you whether your workload even gets close to that ceiling or spends most of its time limited by something else entirely.
The mistake isn’t running these tests. The mistake is stopping there and acting as though you’ve learned the thing you needed to learn. The envelope and the achieved behavior are related, but the relationship is contingent on the full execution context — and as we discussed in the context of spec-sheet limitations, that contingency is where most of the surprises hide.
Performance is workload-bound, not device-bound
A lot of the confusion in AI performance discussions comes from a single implicit assumption: that performance is a stable property of the device that transfers across contexts. “This GPU delivers X TFLOPS” or “this card does Y tokens per second” — these statements sound like device properties, but they’re actually outcomes of specific executions.
Different workloads stress different subsystems. The same workload behaves differently under different batch sizes, sequence lengths, or precision modes. Small changes in the software stack — a framework upgrade that changes which CUDA kernels are dispatched, a driver update that alters scheduling policy — can move the workload between operating regimes without being visible at the configuration level.
In practice, “general performance” is an unreliable abstraction because it assumes stability across contexts that AI workloads don’t provide. When someone tells you a system is “fast,” the right follow-up isn’t skepticism or acceptance — it’s “fast at what, under which stack, measured how?”
Measurement discipline: unglamorous and essential
Empirical measurement is necessary. Measurement discipline is what makes it useful.
Two teams can run “the same model” and get different outcomes because they didn’t measure the same thing — and usually the divergence is in details that seem minor until they aren’t. One run includes warmup in the measurement window, the other excludes it. One captures a transient compilation phase, the other starts timing after graph capture is complete. Caching effects make the first iteration slower and later iterations faster. Batching policy changes under load. Sequence lengths drift across requests. Memory pressure shifts behavior mid-run.
None of these are exotic scenarios. They’re the ordinary texture of executed systems, and they determine whether your measured number is a stable characterization of the system or a particular snapshot that might not reproduce.
If you want defensible results, you need to be able to answer — clearly, honestly — what was executed, what was counted, and what was excluded. A performance claim that can’t state its assumptions isn’t a claim. It’s a vibe.
From claims to decisions
This isn’t a prescription for a specific benchmark suite or a step-by-step evaluation protocol — those shortcuts are exactly how performance evaluation turns into cargo cult. Different organizations, workloads, and constraints call for different approaches.
But there’s a posture that’s defensible regardless of tooling: treat performance conclusions as claims that require explicitly stated assumptions. What does “good” mean for your situation? What workload family are you evaluating against? What stack and system constraints are non-negotiable? What operating regime (latency-optimized, throughput-optimized, cost-constrained) are you targeting?
Once those parameters are stated, the evaluation becomes tractable and the conclusions become auditable. Without them, you’re optimizing in the dark, and no amount of precision in the measurement mechanics can compensate for ambiguity in what you’re trying to learn.
The gap between what benchmarks report and what they mean is almost always a gap in stated assumptions. Closing that gap is the real work of performance evaluation — and it’s work that no spec sheet or leaderboard can do for you.