Half Precision Explained: What FP16 Means for AI Inference and Training

Half the bits, most of the accuracy

Half precision — formally IEEE 754 binary16, commonly called FP16 — is a 16-bit floating-point number format. It uses 1 sign bit, 5 exponent bits, and 10 significand (mantissa) bits. Compared to single precision (FP32, 32 bits), half precision halves the memory footprint per number and can roughly double throughput on hardware with dedicated FP16 compute units, per NVIDIA’s published Tensor Core specifications.

The format represents numbers in the range ±65,504 with approximately 3.3 decimal digits of precision. This is substantially less range and precision than FP32 (±3.4×10³⁸, ~7.2 decimal digits), but for many AI workloads — particularly inference and the forward pass of training — that reduced precision is sufficient. The interesting question is not whether FP16 is “worse” than FP32. It is whether the accuracy your task actually requires fits inside FP16’s representable envelope, and whether the throughput and memory savings are worth the engineering work to keep it there.

That reframing matters because most teams still describe precision the way they describe image compression: as degradation. It is not. Precision is a design parameter you choose against a declared accuracy criterion, and the FP16 question is a worked example of that choice. We treat the broader framing in precision as a design parameter, not a quality compromise; this article stays close to the numeric specifics of FP16.

FP16 vs FP32 vs BF16: the precision landscape

Format	Bits	Exponent	Mantissa	Range	Decimal precision	Primary use
FP32	32	8	23	±3.4×10³⁸	~7.2 digits	Traditional training, scientific computing
FP16	16	5	10	±65,504	~3.3 digits	Inference, mixed-precision training
BF16	16	8	7	±3.4×10³⁸	~2.4 digits	Training on Tensor Cores / TPUs
FP8	8	4 or 5	3 or 2	Narrow	~1–2 digits	Quantised inference, edge deployment

In our experience, the critical difference between FP16 and BF16 is the exponent-mantissa trade-off. FP16 has more mantissa bits (better precision) but fewer exponent bits (smaller dynamic range). BF16 keeps FP32’s exponent range at the cost of mantissa precision. For training, BF16’s wider dynamic range avoids gradient underflow — a common failure mode when training large models in pure FP16 — and is why current default mixed-precision recipes on A100, H100, and TPU hardware lean BF16. For pure inference, FP16’s extra mantissa bits are often the better fit, because activations during a forward pass rarely span the dynamic range that gradients do.

This is the first place the “higher precision = higher quality” intuition breaks. BF16 is less precise than FP16 in the strict numerical-resolution sense, and yet it is the better choice for many training workloads. The pattern is observed across enough projects that we treat it as a planning heuristic, not a benchmarked rate: default to BF16 for training, FP16 for inference serving, and only revisit when the evaluation set says otherwise.

Where half precision works, and where it doesn’t

Half precision is effective when the computation’s accuracy requirements are satisfied by ~3.3 decimal digits and when the values being represented stay within ±65,504. In practice, this covers:

Neural network inference. Model weights and activations typically fall within FP16’s representable range. Rounding errors are usually smaller than the model’s own prediction uncertainty.
Mixed-precision training. Forward pass and weight storage in FP16; gradient accumulation and master weights in FP32. NVIDIA’s published mixed-precision recipe (introduced with Volta) reports near-FP32 accuracy at roughly 2× training speed — a benchmark figure that holds up in configurations we’ve replicated, with the caveat that “near-FP32” must be verified on the actual task, not assumed.
Image processing pipelines. Pixel values normalised to [0, 1] or [-1, 1] sit comfortably inside FP16’s range and precision.

Half precision struggles or fails when:

Gradients underflow. Small gradient values quietly become zero in FP16. Loss scaling — multiplying the loss before backpropagation, then dividing gradients after — prevents this but adds an implementation surface (and, occasionally, a debugging surface).
The dynamic range is wide. Financial computations, physics simulations, or anything spanning many orders of magnitude will overflow FP16’s ±65,504 ceiling. Overflow becomes inf, and inf poisons everything downstream.
A single operation is precision-sensitive. Attention score computation in transformers can suffer from FP16 precision loss, which is why many production implementations keep the softmax in FP32 even when the rest of the forward pass runs in FP16. This is a per-operator decision, not a per-model one.

Notice that none of these are inherent failures of FP16. They are mismatches between the format and a specific operation. The engineering task is to identify which operations need the extra range or precision and surgically promote those, not to abandon FP16 wholesale.

How does mixed-precision training actually keep accuracy?

The naive question is whether FP16 training “loses accuracy”. The more useful question is what mixed-precision recipes do to prevent that loss in the first place.

Three mechanisms do most of the work. First, master weights in FP32: the authoritative copy of the model’s parameters lives in FP32, and the FP16 weights used in the forward and backward pass are a cast-down view. Updates accumulate at full precision, avoiding the slow drift that pure-FP16 weight updates produce. Second, loss scaling, which keeps gradient magnitudes inside FP16’s representable range — torch.cuda.amp.GradScaler and equivalents in TensorFlow and JAX automate the dynamic adjustment. Third, selective FP32 for sensitive operators: softmax, layer norm, and reductions over many elements are commonly executed in FP32 even inside an otherwise FP16 forward pass.

The combined recipe is why “FP16 training” in practice almost always means “mixed-precision training with FP16 compute”. Pure-FP16 end-to-end training exists but is rare in production. Without loss scaling, FP16 training diverges on a meaningful share of model configurations we have tried — an observed pattern across our engagements rather than a benchmarked rate, but the failure mode is structural enough that we treat it as a default-on safety mechanism rather than an optimisation.

When does FP16 precision loss become visible in practice?

Two scenarios make FP16 precision loss observable. The first is loss of magnitude range: values outside ±65,504 overflow to infinity. The second is loss of precision in accumulation: summing many small FP16 values loses significant digits to rounding.

The first scenario affects training more than inference, because gradients span a wider dynamic range than activations. Loss scaling addresses it. The second affects inference quality for models that perform long sequential accumulations — autoregressive language models generating hundreds of tokens, where each token generation involves many small additions that accumulate rounding error. For sequences under roughly 512 tokens, we typically observe no measurable quality difference between FP16 and FP32 inference on current transformer models (observed pattern, not a benchmarked rate, on the workloads we’ve measured). For sequences in the 2,048-token range and beyond, small differences begin to surface — on the order of 0.5–1.5% perplexity increase in configurations we’ve tested, rarely impactful for practical applications but worth measuring if the deployment has tight quality budgets.

BF16 sidesteps the magnitude problem by using FP32’s exponent range, at the cost of mantissa precision. This makes BF16 more robust for training (loss scaling is typically unnecessary) but slightly less precise for inference. Our default is BF16 for training and FP16 for inference serving, revisited only when the model shows quantifiable degradation at FP16 on the actual evaluation set.

Hardware support is what makes the trade real

Half precision’s practical value is gated by hardware support. The compute units have to exist:

NVIDIA Tensor Cores (Volta and later): roughly 2× throughput for FP16 vs FP32 matrix operations, per NVIDIA’s published specifications.
AMD CDNA: MI250X and MI300X provide FP16 matrix units at approximately 2× FP32 throughput.
Apple Neural Engine: native FP16 support, tuned for inference on Apple Silicon.
Google TPUs: BF16 is the default, but FP16 is supported.

When benchmarking AI hardware, precision format is not optional metadata — it is a primary variable. A GPU benchmarked at FP16 will report different throughput numbers than the same GPU benchmarked at FP32, and the two figures are not directly comparable without naming the format. Comparing hardware performance across different precisions without normalisation produces misleading conclusions, which is why every credible benchmark publishes the precision alongside the throughput.

LynxBench AI treats per-precision performance and a declared accuracy criterion as a paired output of the AI Executor specification, because an FP16 throughput claim that omits the accuracy regression on the workload describes the silicon’s peak rather than the deployment’s outcome. The question to put to any FP16 inference or training claim is whether throughput and an accuracy criterion on the same workload are reported jointly, or whether the precision change is being sold as a free win that the operator’s evaluation set has not yet measured. The framing carries through to every adjacent decision in an AI Executor specification: the format you pick for the forward pass, the master weights, the gradient accumulation, and the sensitive operators are four separate choices, not one. Treating them as one is where most “FP16 is too lossy” arguments come from — and where most “FP16 is a free win” arguments fall apart. Which precision regime — FP16, BF16, or a mixed schedule — does each path in your training or inference run actually require, and is that the regime the benchmark you cite held constant?

Frequently Asked Questions

When should I pick FP16 over BF16 for a given workload?

For pure inference serving, FP16’s extra mantissa bits usually make it the better fit, since activations rarely span the dynamic range gradients do. For training, BF16’s wider exponent range avoids gradient underflow and is the safer default on A100, H100, and TPU hardware. We default to BF16 for training and FP16 for inference, and only revisit when the evaluation set shows quantifiable degradation.

How does loss scaling stop FP16 training from diverging?

Loss scaling multiplies the loss before backpropagation and divides the gradients afterward, keeping small gradient magnitudes inside FP16’s representable range so they do not silently underflow to zero. Tools like torch.cuda.amp.GradScaler automate the dynamic adjustment. Without it, FP16 training diverges on a meaningful share of model configurations, which is why we treat it as a default-on safety mechanism rather than an optimisation.

At what sequence length does FP16 inference start losing measurable quality?

For sequences under roughly 512 tokens, we typically observe no measurable quality difference between FP16 and FP32 inference on current transformer models. Past the 2,048-token range, accumulated rounding error in long sequential additions begins to surface — on the order of a 0.5–1.5% perplexity increase in configurations we’ve tested. That is rarely impactful in practice, but worth measuring when the deployment has tight quality budgets.

Why must a benchmark publish its precision format alongside throughput?

A GPU benchmarked at FP16 reports different throughput than the same GPU at FP32, and the two figures are not directly comparable without naming the format. Comparing hardware across different precisions without normalisation produces misleading conclusions. Precision format is a primary benchmark variable, not optional metadata, which is why every credible benchmark states it alongside the throughput number.