TOPS Performance Across the Hardware-Software Stack: Why Identical TOPS Deliver Different Throughput

TOPS is a marketing metric that obscures hardware comparison

TOPS — Tera Operations Per Second — appears on every AI chip spec sheet. It sounds like a direct measure of AI processing capability. It is not. TOPS is a peak throughput figure for integer operations (INT8 or INT4) measured under ideal conditions that no real workload achieves.

This piece focuses on the hardware-software stack as the reason identical TOPS scores deliver different throughput in deployment. Two adjacent questions live in companion articles: what TOPS on the spec sheet measures (and why no transformation of the headline number predicts performance) is covered in AI TOPS on the spec sheet; how TOPS interacts with GPU utilization as a metric is covered in AI TOPS and GPU utilization. This piece treats the stack — kernels, runtime, memory hierarchy, batching — as the dominant explanation for the gap between TOPS and observed tokens-per-second.

The metric is useful for understanding the theoretical ceiling of a chip’s integer compute. It is not a basis for comparing chips, selecting hardware, or predicting inference speed. We see this confusion regularly in procurement conversations: a buyer compares two accelerators by their headline TOPS, picks the larger number, and then discovers months later that the deployed model runs slower than on the rejected option. The number was real. The framing was wrong.

What does TOPS actually measure?

TOPS counts integer multiply-accumulate operations per second at the chip’s rated precision and power consumption. The calculation is:

TOPS = peak_INT8_ops_per_cycle × clock_frequency × core_count

This measures theoretical peak throughput assuming 100% utilization of all compute units simultaneously, zero memory bandwidth bottleneck, perfect data distribution across compute units, and no overhead for data movement, control flow, or communication. None of those assumptions hold in real workloads — and the gap between them and reality is where the hardware × software stack reasserts itself. Performance is not a property of the silicon alone; it is what emerges when a particular model, on a particular runtime (PyTorch with torch.compile, TensorRT, ONNX Runtime), feeds a particular memory hierarchy under realistic batch and context sizes.

Why TOPS comparisons mislead

There are four recurring patterns we see when teams reason from TOPS numbers in isolation.

Precision mismatches. 100 TOPS at INT8 is not the same as 100 TOPS at FP16 or BF16. Many TOPS figures are quoted at INT4, which provides roughly 2× the TOPS of INT8 but requires more aggressive quantization that may not be acceptable for your model quality requirements. A headline number stripped of its precision regime is decorative, not diagnostic.

Utilization ceilings. Observed real-world utilization typically sits in the 30–70% range of peak TOPS depending on workload structure, memory access patterns, and model architecture (observed pattern across our infrastructure engagements; not a benchmarked rate). The remaining headroom rarely closes without substantial kernel-level work — FlashAttention, kernel fusion in TensorRT or XLA, NCCL topology tuning — and even then it closes asymmetrically across operator types.

Memory bandwidth ignored. A chip with 200 TOPS and 50 GB/s memory bandwidth will be memory-bandwidth-bound for most LLM inference workloads long before it approaches the compute ceiling. This is the most common mismatch in 2025–2026 procurement: transformer decode is dominated by KV-cache reads, and additional TOPS on a bandwidth-starved part are inert.

Different operation types. The operations that dominate transformer inference — large matrix multiplications, attention with growing KV-cache, all-reduce across NVLink or PCIe — stress compute and memory bandwidth differently than the small dot products that dominate published INT8 benchmarks.

A worked TOPS comparison

Chip	Quoted spec	Reality on LLM inference at 8k context
Chip A	100 TOPS INT8, 100 GB/s bandwidth	Memory-bound; achieved fraction of peak well below 30%
Chip B	60 TOPS FP16, 400 GB/s bandwidth	Compute-bound; materially faster end-to-end

Chip B has the smaller TOPS number and wins. The reason is structural, not anecdotal: at 8k context, decode is dominated by streaming the KV-cache through HBM. Bandwidth, not TOPS, sets the ceiling. The performance-emerges-from-the-stack framing explains why any single-number metric — TOPS, FLOPS, even memory bandwidth in isolation — is an incomplete characterization of AI hardware performance.The deeper point is that the same TOPS number lands differently depending on the software stack sitting on top of it. A fixed GPU runs at very different effective throughput depending on the driver version, the runtime (TensorRT-LLM, vLLM, ONNX Runtime), the compiler path (torch.compile, XLA), and the framework’s memory and scheduling behaviour. Swap the runtime under the same silicon and the achieved-over-peak ratio moves — sometimes by a factor of two — without a single hardware change. That is why performance has to be reasoned about as a stack rather than read off a spec line: the layers interact, and the interaction is where real-world AI performance is actually decided.

When does TOPS mislead hardware decisions?

TOPS specifications mislead most frequently when comparing chips designed for different workload profiles. A part optimised for INT8 edge inference and a part optimised for FP16 training cannot be compared by their TOPS numbers alone. The INT8 part may report 4× the TOPS of the FP16 part, but if your workload requires FP16 precision, the INT8 TOPS figure is irrelevant — the comparison is not just unfavourable, it is undefined.

The second common mislead is the gap between peak TOPS and achieved TOPS. Peak assumes full utilisation of all compute units simultaneously, which requires perfectly parallelisable workloads with no memory stalls, no synchronisation overhead, and no idle cycles. In our experience, AI workloads on well-optimised hardware tend to achieve roughly 30–60% of peak TOPS; on poorly optimised deployments the achieved fraction can sit below 15% (observed pattern, not a published benchmark). The ratio itself — achieved divided by peak, on the target workload at the target precision — is the diagnostic signal. A ratio below 0.3 on a well-maintained system usually indicates a memory-bandwidth bottleneck rather than insufficient compute. Adding more TOPS would not help.

For procurement decisions, we recommend specifying required performance in application-level terms — tokens per second, images per second, P99 inference latency under realistic batch and context — rather than TOPS. Two GPUs with identical TOPS specifications but different HBM bandwidth and NVLink topology will deliver materially different performance on memory-bound workloads.

TOPS at different precisions are not comparable

Hardware specifications often list TOPS at multiple precisions: INT8, FP16, BF16, FP32. These numbers are not interchangeable. A chip rated at 400 TOPS INT8 and 200 TFLOPS FP16 does not offer equivalent performance for INT8 and FP16 workloads — the 2× ratio reflects the arithmetic simplicity of INT8 operations, not a quality–performance tradeoff that you can dial in freely.

The practical question is which precision your workload actually requires. LLM inference at acceptable quality typically requires FP16 or BF16, sometimes with selective INT8 weight quantization through TensorRT-LLM or vLLM. Vision model inference often works well at INT8 after quantisation-aware training. Edge accelerators optimised for INT8 — Google Edge TPU, Intel Movidius, the NPUs in current laptop SoCs — deliver impressive TOPS numbers that are irrelevant if your model requires FP16; they either cannot run FP16 at all, or run it at dramatically lower throughput.

We evaluate hardware at the precision our target workload requires. A chip with 200 TOPS INT8 but no FP16 support is not useful for LLM serving. A chip with 50 TFLOPS FP16 but mediocre INT8 throughput is not ideal for edge vision deployment. Matching precision capability to workload requirement is the first filter before TOPS numbers become comparable at all.

How to read a TOPS spec without being misled

A short diagnostic checklist for the next time a vendor deck lands on the table:

Name the precision. INT4, INT8, FP16, BF16, FP32 — and refuse to compare across them.
Compute the bandwidth ratio. Divide TOPS by GB/s memory bandwidth. If the ratio is high relative to peers, the part is memory-bandwidth-skewed and will under-deliver on transformer decode.
Ask for sustained, not peak. Under sustained thermal load, on the runtime you will actually deploy (TensorRT, vLLM, ONNX Runtime, custom CUDA), with realistic batch and context.
Measure achieved-over-peak on your workload. This single ratio collapses most of the procurement debate into one defensible number.
Restate the requirement in application terms. Tokens per second at a given P99 latency, frames per second at a given resolution. TOPS is then a sanity-check on the upper bound, not the specification.

Frequently Asked Questions

How does the choice of software stack change the effective TOPS you get from a fixed GPU?

The same GPU delivers very different achieved throughput depending on the driver, runtime, compiler path, and framework above it. Moving a model from a generic ONNX Runtime path to TensorRT-LLM or vLLM, or enabling torch.compile and kernel fusion, can shift the achieved-over-peak ratio substantially — sometimes by a factor of two — with no change to the silicon. The headline TOPS figure is fixed; the fraction of it you actually reach is set by the stack.

What does a hardware × software stack look like in layers, and where do the interactions decide TOPS throughput?

The layers run from silicon and memory hierarchy, through drivers and the runtime, up through the compiler and framework to the model and its batch/context shape. On transformer inference the interactions that dominate are kernel quality against the memory hierarchy (FlashAttention, fused kernels) and how the runtime streams the KV-cache through HBM at long context. A part with abundant TOPS but starved bandwidth, or with no fused-kernel support in its deployed runtime, under-delivers regardless of its spec sheet.

When does a TOPS specification actually mislead a hardware decision?

TOPS misleads most when comparing chips built for different workload profiles or precisions — an INT8 edge part against an FP16 serving part cannot be ranked by TOPS alone. It also misleads through the gap between peak and achieved: peak assumes perfect utilisation, while real workloads on well-optimised hardware tend to reach only 30–60% of peak (observed pattern, not a published benchmark). A low achieved-over-peak ratio on a maintained system usually signals a bandwidth bottleneck, where adding TOPS would not help.

How should I specify hardware requirements instead of quoting TOPS?

Specify in application-level terms: tokens per second, images per second, or P99 inference latency under realistic batch and context, on the runtime you will actually deploy. Then measure achieved-over-peak on your own workload at your target precision — that single ratio collapses most of the procurement debate into one defensible number. TOPS becomes a sanity-check on the upper bound, not the specification.