“Can we just switch to FP8?”
The request sounds simple. FP8 offers roughly 2× the throughput of BF16 on supported hardware, halves the memory footprint, and enables larger models on fewer GPUs. The ML engineer or infrastructure planner hears “faster and cheaper” and reasonably asks: why aren’t we already using it?
The answer, more often than not, is hardware. The precision formats a GPU can accelerate natively are determined by its tensor core architecture, and that architecture varies across generations. A format the hardware doesn’t support natively doesn’t just run slower — it may offer no throughput benefit at all, or it may not be supported by the deployment framework for that hardware target.
Precision decisions are hardware-conditional. Understanding the constraints is the prerequisite to making them well.
Tensor core generations and their numerical affordances
NVIDIA’s tensor core architecture has evolved across GPU generations, and each generation added support for new numerical formats while maintaining backward compatibility:
Volta (V100): First-generation tensor cores. Native support for FP16 matrix multiply with FP32 accumulation. No BF16, no INT8 tensor core support, no FP8. The V100 was foundational for mixed-precision training, but its format menu is limited by today’s standards.
Ampere (A100): Third-generation tensor cores. Added native BF16 and TF32 (an internal 19-bit format used transparently for FP32 operations). Added INT8 and INT4 tensor core support for inference quantization. No native FP8. The A100 is still widely deployed and handles BF16 inference and training well, but cannot accelerate FP8 workloads.
Hopper (H100, H200): Fourth-generation tensor cores. Added native FP8 (both E4M3 and E5M2 variants), along with the Transformer Engine that manages dynamic per-tensor scaling for FP8 automatically. BF16 throughput also increased substantially over Ampere.
Blackwell (B100, B200): Further FP8 optimization and potential FP4 support, with enhanced Transformer Engine capabilities.
Each generation defines a different menu of viable precision choices. A deployment targeting V100s is limited to FP16 mixed precision for tensor core acceleration. A deployment targeting A100s can use BF16, INT8, or INT4, but not FP8. A deployment targeting H100s can use any of the above plus FP8.
This isn’t a software limitation that can be patched. It’s a hardware constraint: the silicon either has execution units for a given format or it doesn’t.
The penalty for unsupported formats
Running a precision format that the hardware doesn’t natively accelerate doesn’t cause an error — the framework will typically fall back to a supported format or use software emulation. But the performance implications are serious.
FP8 operations on A100 hardware execute on BF16 or FP16 tensor cores with conversion overhead, producing throughput roughly comparable to BF16 — meaning no FP8 advantage despite the lower precision. The memory savings from FP8 model representation still apply (the model is smaller in HBM), but the compute throughput doubles that FP8 promises on Hopper hardware simply don’t materialize on Ampere.
Similarly, INT8 inference on V100 runs without dedicated tensor core support, falling back to CUDA cores with dramatically lower throughput than the INT8 tensor core path available on A100.
The practical implication: a precision strategy developed and benchmarked on one hardware generation cannot be applied to a different generation without re-evaluating whether the target format is natively supported. Benchmark results measured on H100 at FP8 tell you nothing about FP8 performance on A100 — because on A100, “FP8 performance” effectively doesn’t exist at the hardware acceleration level.
Framework and tooling dependencies
Hardware support is necessary but not sufficient. The deployment framework must also support the target precision format on the target hardware, with optimized kernels and correct numerical handling.
TensorRT, NVIDIA’s inference optimizer, added FP8 support alongside Hopper hardware. Earlier TensorRT versions targeting A100 support INT8 and FP16 but not FP8. PyTorch’s native FP8 support arrived with specific versions and requires Hopper hardware with compatible CUDA toolkit versions.
This creates a three-layer compatibility requirement: the hardware must support the format, the framework must support the format on that hardware, and the CUDA/driver stack must be at a version that enables the feature. A mismatch at any layer — current-generation hardware with an older framework version, or a current framework on previous-generation hardware — blocks the precision strategy.
As discussed in how FP8, FP16, and BF16 represent different regimes, the format choice encodes assumptions about numerical behavior. The hardware choice determines which of those assumptions can actually be realized efficiently.
Why this matters for hardware selection
Precision support should be an explicit factor in hardware evaluation, not a footnote.
If an organization’s inference workload benefits substantially from FP8 (large language models, high-throughput serving, memory-bandwidth-bound operations), then the hardware evaluation must weigh FP8 tensor core support as a first-class requirement — because the throughput and efficiency gains from FP8 often exceed the gains from raw compute improvement between generations.
Conversely, if the workload is precision-sensitive and will remain at BF16 or FP32, then the absence of FP8 support on the target hardware is irrelevant, and the hardware evaluation should focus on other criteria.
The mistake is evaluating hardware at one precision and deploying at another without accounting for the performance change, or assuming that a format available on the newest generation is available on currently deployed hardware.
An honest hardware evaluation declares: “this is the precision format we intend to deploy, this hardware supports it natively, and our benchmark results were measured at this format.” Anything less creates a gap between the evaluation and the deployment that the economics of precision choice will eventually expose. The hardware doesn’t bend to the precision strategy. The precision strategy must fit the hardware.