TOPS Performance Across the Hardware-Software Stack: Why Identical TOPS Deliver Different Throughput

How the hardware-software stack determines achieved-vs-peak TOPS on real AI workloads, and why identical TOPS scores deliver different deployment throughput.

TOPS Performance Across the Hardware-Software Stack: Why Identical TOPS Deliver Different Throughput
Written by TechnoLynx Published on 10 May 2026

TOPS is a marketing metric that obscures hardware comparison

TOPS — Tera Operations Per Second — appears on every AI chip spec sheet. It sounds like a direct measure of AI processing capability. It is not. TOPS is a peak throughput figure for integer operations (INT8 or INT4) measured under ideal conditions that no real workload achieves.

This piece focuses on the hardware-software stack as the reason identical TOPS scores deliver different throughput in deployment. Two adjacent questions live in companion articles: what TOPS on the spec sheet measures (and why no transformation of the headline number predicts performance) is covered in AI TOPS on the spec sheet; how TOPS interacts with GPU utilization as a metric is covered in AI TOPS and GPU utilization. This piece treats the stack — kernels, runtime, memory hierarchy, batching — as the dominant explanation for the gap between TOPS and observed tokens-per-second.

The metric is useful for understanding the theoretical ceiling of a chip’s integer compute. It is not a basis for comparing chips, selecting hardware, or predicting inference speed. We see this confusion regularly in procurement conversations: a buyer compares two accelerators by their headline TOPS, picks the larger number, and then discovers months later that the deployed model runs slower than on the rejected option. The number was real. The framing was wrong.

What does TOPS actually measure?

TOPS counts integer multiply-accumulate operations per second at the chip’s rated precision and power consumption. The calculation is:

TOPS = peak_INT8_ops_per_cycle × clock_frequency × core_count

This measures theoretical peak throughput assuming 100% utilization of all compute units simultaneously, zero memory bandwidth bottleneck, perfect data distribution across compute units, and no overhead for data movement, control flow, or communication. None of those assumptions hold in real workloads — and the gap between them and reality is where the hardware × software stack reasserts itself. Performance is not a property of the silicon alone; it is what emerges when a particular model, on a particular runtime (PyTorch with torch.compile, TensorRT, ONNX Runtime), feeds a particular memory hierarchy under realistic batch and context sizes.

Why TOPS comparisons mislead

There are four recurring patterns we see when teams reason from TOPS numbers in isolation.

Precision mismatches. 100 TOPS at INT8 is not the same as 100 TOPS at FP16 or BF16. Many TOPS figures are quoted at INT4, which provides roughly 2× the TOPS of INT8 but requires more aggressive quantization that may not be acceptable for your model quality requirements. A headline number stripped of its precision regime is decorative, not diagnostic.

Utilization ceilings. Observed real-world utilization typically sits in the 30–70% range of peak TOPS depending on workload structure, memory access patterns, and model architecture (observed pattern across our infrastructure engagements; not a benchmarked rate). The remaining headroom rarely closes without substantial kernel-level work — FlashAttention, kernel fusion in TensorRT or XLA, NCCL topology tuning — and even then it closes asymmetrically across operator types.

Memory bandwidth ignored. A chip with 200 TOPS and 50 GB/s memory bandwidth will be memory-bandwidth-bound for most LLM inference workloads long before it approaches the compute ceiling. This is the most common mismatch in 2025–2026 procurement: transformer decode is dominated by KV-cache reads, and additional TOPS on a bandwidth-starved part are inert.

Different operation types. The operations that dominate transformer inference — large matrix multiplications, attention with growing KV-cache, all-reduce across NVLink or PCIe — stress compute and memory bandwidth differently than the small dot products that dominate published INT8 benchmarks.

A worked TOPS comparison

Chip Quoted spec Reality on LLM inference at 8k context
Chip A 100 TOPS INT8, 100 GB/s bandwidth Memory-bound; achieved fraction of peak well below 30%
Chip B 60 TOPS FP16, 400 GB/s bandwidth Compute-bound; materially faster end-to-end

Chip B has the smaller TOPS number and wins. The reason is structural, not anecdotal: at 8k context, decode is dominated by streaming the KV-cache through HBM. Bandwidth, not TOPS, sets the ceiling. The performance-emerges-from-the-stack framing explains why any single-number metric — TOPS, FLOPS, even memory bandwidth in isolation — is an incomplete characterization of AI hardware performance.

When does TOPS mislead hardware decisions?

TOPS specifications mislead most frequently when comparing chips designed for different workload profiles. A part optimised for INT8 edge inference and a part optimised for FP16 training cannot be compared by their TOPS numbers alone. The INT8 part may report 4× the TOPS of the FP16 part, but if your workload requires FP16 precision, the INT8 TOPS figure is irrelevant — the comparison is not just unfavourable, it is undefined.

The second common mislead is the gap between peak TOPS and achieved TOPS. Peak assumes full utilisation of all compute units simultaneously, which requires perfectly parallelisable workloads with no memory stalls, no synchronisation overhead, and no idle cycles. In our experience, AI workloads on well-optimised hardware tend to achieve roughly 30–60% of peak TOPS; on poorly optimised deployments the achieved fraction can sit below 15% (observed pattern, not a published benchmark). The ratio itself — achieved divided by peak, on the target workload at the target precision — is the diagnostic signal. A ratio below 0.3 on a well-maintained system usually indicates a memory-bandwidth bottleneck rather than insufficient compute. Adding more TOPS would not help.

For procurement decisions, we recommend specifying required performance in application-level terms — tokens per second, images per second, P99 inference latency under realistic batch and context — rather than TOPS. Two GPUs with identical TOPS specifications but different HBM bandwidth and NVLink topology will deliver materially different performance on memory-bound workloads.

TOPS at different precisions are not comparable

Hardware specifications often list TOPS at multiple precisions: INT8, FP16, BF16, FP32. These numbers are not interchangeable. A chip rated at 400 TOPS INT8 and 200 TFLOPS FP16 does not offer equivalent performance for INT8 and FP16 workloads — the 2× ratio reflects the arithmetic simplicity of INT8 operations, not a quality–performance tradeoff that you can dial in freely.

The practical question is which precision your workload actually requires. LLM inference at acceptable quality typically requires FP16 or BF16, sometimes with selective INT8 weight quantization through TensorRT-LLM or vLLM. Vision model inference often works well at INT8 after quantisation-aware training. Edge accelerators optimised for INT8 — Google Edge TPU, Intel Movidius, the NPUs in current laptop SoCs — deliver impressive TOPS numbers that are irrelevant if your model requires FP16; they either cannot run FP16 at all, or run it at dramatically lower throughput.

We evaluate hardware at the precision our target workload requires. A chip with 200 TOPS INT8 but no FP16 support is not useful for LLM serving. A chip with 50 TFLOPS FP16 but mediocre INT8 throughput is not ideal for edge vision deployment. Matching precision capability to workload requirement is the first filter before TOPS numbers become comparable at all.

How to read a TOPS spec without being misled

A short diagnostic checklist for the next time a vendor deck lands on the table:

  • Name the precision. INT4, INT8, FP16, BF16, FP32 — and refuse to compare across them.
  • Compute the bandwidth ratio. Divide TOPS by GB/s memory bandwidth. If the ratio is high relative to peers, the part is memory-bandwidth-skewed and will under-deliver on transformer decode.
  • Ask for sustained, not peak. Under sustained thermal load, on the runtime you will actually deploy (TensorRT, vLLM, ONNX Runtime, custom CUDA), with realistic batch and context.
  • Measure achieved-over-peak on your workload. This single ratio collapses most of the procurement debate into one defensible number.
  • Restate the requirement in application terms. Tokens per second at a given P99 latency, frames per second at a given resolution. TOPS is then a sanity-check on the upper bound, not the specification.

FAQ

The question to put to any TOPS-based hardware claim is straightforward: is the precision regime named, and is the deployment-relevant fraction of peak measured on the buyer’s workload — or is the headline TOPS number being asked to predict an outcome the methodology has not measured?

Back See Blogs
arrow icon