Quantization in Machine Learning: A Family of Calibrated Trade-Offs

“Quantization in machine learning” is not one technique

The phrase “quantization in machine learning” routinely gets used as if it referred to a single, well-defined transformation — flip a switch, the model becomes faster, occasionally something breaks. It does not. Quantization in ML is a family of transformations, parameterized by what is being quantized, by which calibration data, by which method, and by which target format. Two models described as “quantized to INT8” can have substantially different accuracy and runtime behavior depending on which member of the family was applied and how.

Treating quantization as one thing produces a particular kind of mistake: generalizing a result obtained on one model family to another. The accuracy regression observed for INT8 on a convolutional vision model is not predictive of the regression for INT8 on a transformer language model, because the two have different activation-distribution properties that interact differently with low-precision representations. The generalization is not slightly off — it is structurally wrong. We see this pattern often enough that, in our experience, the first question to ask of any quantization result is which member of the family was actually applied.

What does the technique actually do?

Quantization in ML is the practice of replacing higher-precision numerical representations — typically FP32 or FP16 weights and activations — with lower-precision ones, most commonly INT8 or INT4, under a calibration procedure that minimizes the resulting numerical error on a representative input distribution.

Three things in that sentence carry weight.

The first is replacement. The lower-precision values stand in for the original values during inference. They are not approximations layered on top of the originals; they are the numerical representation the runtime — TensorRT, ONNX Runtime’s quantization toolkit, llama.cpp’s GGUF kernels, PyTorch’s torch.ao.quantization, or a vendor-specific compiler — actually uses end-to-end.

The second is calibration. The mapping from the original value range to the discrete set of representable values in the lower-precision format is not arbitrary. It is chosen to minimize the error introduced into the model’s behavior on inputs drawn from a calibration set. The calibration set is the implicit assumption of the quantized model — it is the workload distribution the quantization is optimized for. Change the workload, and the calibration’s relevance shifts with it.

The third is family. There is no single quantization scheme. There is symmetric versus asymmetric quantization, per-tensor versus per-channel scale factors, post-training quantization versus quantization-aware training, weight-only versus weight-and-activation quantization, and outlier-aware variants like SmoothQuant, GPTQ, and AWQ that handle specific distributional pathologies. Each combination produces a different accuracy/throughput trade-off and a different runtime kernel requirement.

Bounded numerical error, not model damage

It helps to be precise about what quantization introduces. Quantization introduces a bounded numerical error per tensor element: the difference between the original FP32 value and the nearest representable value in the target format, scaled by the quantization step. The bound is not folklore — it is a property of the chosen scale and zero-point. For symmetric per-tensor INT8 on a range [-r, r], the per-element error is bounded by roughly r / 127 in magnitude (an observed-pattern characterization that follows directly from the format definition).

This is a different kind of error from training error. Training error reflects the model’s inability to fit the data-generating distribution; it is statistical in origin and improves with more data or better optimization. Quantization error is deterministic given the scheme and calibration — the same input, twice, produces the same quantized activations and the same downstream deviation. That difference matters for evaluation: training-time noise can be averaged out, but quantization error compounds along whichever computational path the model uses, in a reproducible way.

The error is bounded, but the effect of that error on model outputs is not bounded by the same constant. A small numerical perturbation to an attention logit can change which token wins an argmax; a small perturbation to a softmax probability can shift a generation trajectory. The bound on element-wise error tells you what the format can do; the model’s structure tells you what that does to the output.

Why model family changes the risk

The accuracy risk of a given quantization scheme is conditional on the activation distributions of the model it is applied to. Different model families have systematically different activation behavior, which is why a “this works at INT8” generalization rarely transfers cleanly across families.

Convolutional models with bounded-range activations — for example, vision backbones with batch normalization and ReLU activations — typically have well-behaved activation distributions: most values cluster near zero, the range is bounded, and outliers are rare. INT8 quantization on this regime tends to produce negligible accuracy regression with off-the-shelf calibration, because the format’s representable range covers the activation distribution comfortably.

Attention-based models — transformers, including but not limited to LLMs — have activation distributions that include occasional large outliers, particularly in the input embeddings and attention scores. The same INT8 scheme that works cleanly on a convolutional backbone can produce substantial accuracy regression on a transformer when applied without an outlier-aware calibration scheme, because the format’s representable range either has to be set wide enough to cover the outliers — leaving the typical values represented at coarse granularity — or has to clip the outliers, which discards information the model uses. This is the distributional asymmetry that motivated work like SmoothQuant and the activation-aware variants now standard in LLM serving stacks.

LLMs amplify the transformer pattern further: their long autoregressive generation paths compound small per-token probability shifts into qualitatively different outputs many tokens downstream. A quantization scheme that produces a small per-token accuracy regression on a single-shot benchmark can produce a large generation-quality regression on long outputs. We have observed this gap in multiple engagements — the single-shot perplexity number understates what shows up at 2k+ generated tokens.

The implication is not that quantization “doesn’t work” on transformers or LLMs. It is that the quantization scheme that works on them is different from the one that works on convolutional models, and the calibration requirements are stricter.

Comparing quantization risk across model families

Model family	Activation distribution	Typical INT8 risk	What calibration must capture
Convolutional vision models	Bounded, near-zero-centered, few outliers	Small; off-the-shelf calibration usually sufficient	Representative image distribution
Transformer encoders (e.g. classification)	Includes occasional outliers in embeddings and attention	Moderate; outlier-aware schemes helpful	Representative input distribution including edge cases
Transformer decoders (LLMs)	Outliers compound through autoregressive generation	Substantial without scheme adjustment; weight-and-activation INT8 is risky	Workload-shaped prompts including long generations
Recurrent models	Activation magnitudes can drift along long sequences	Variable; sequence-length-conditional	Calibration over deployment-representative sequence lengths

The pattern is consistent (observed-pattern across our engagements, not a benchmarked rate): the more the model’s activation distribution can produce values that strain the representable range of the low-precision format, the stricter the calibration requirements become, and the smaller the universe of off-the-shelf quantization schemes that produce acceptable accuracy.

What this means for evaluating quantization claims

A quantization claim that omits the model family it was demonstrated on is structurally incomplete. “Quantization works at INT8 with negligible accuracy loss” is a true statement about some models and a false one about others. The same is true of claims framed at the format level — “INT4 quantization preserves accuracy” — without naming the calibration scheme and the model family.

The right question for any quantization claim is not “is the accuracy loss small?” but “on which model family, with which scheme, with which calibration, evaluated on which workload, was the accuracy loss small?” Removing any of those four dimensions removes information the result depends on. A quantization-aware evaluation — the kind a serving team actually needs — exposes each of those dimensions and reports the workload-shaped accuracy delta, not just a headline number.

The framing that actually helps

Quantization in ML is a calibrated approximation discipline. Its results are conditional on the calibration data, the scheme parameters, the model family’s activation distribution, and the evaluation workload. The general principle that quantization is controlled approximation rather than damage holds across the whole family — but the constants in the equation differ across model families, and reporting a result without those constants is reporting a number without its units.

LynxBench AI treats quantization as a per-model-family, per-scheme, per-calibration evaluation regime — not as a single binary “quantized or not” axis — because the conditions of the result are what determine whether the result transfers to the workload that needs it. Stripped of the format name, on which model family, scheme, and calibration corpus did the quantization result you are about to cite establish achievable quality at this precision — and which of those constants is the deployment workload silently being asked to extrapolate across, rather than measure against?

Frequently Asked Questions

How does post-training quantization (PTQ) differ from quantization-aware training (QAT)?

PTQ applies the lower-precision mapping after the model is trained, choosing scales and zero-points to minimize error on a calibration set without changing the weights. QAT instead simulates quantization during training so the optimizer can adapt the weights to the coarser representation, which usually recovers more accuracy at the cost of a training run. PTQ is faster and needs no labelled training loop; QAT is the lever you reach for when PTQ leaves accuracy regression you cannot accept on a sensitive model family.

Why does symmetric versus asymmetric quantization change the error characteristics at the same bit width?

Symmetric quantization centres the representable range on zero with a single scale, while asymmetric quantization adds a zero-point offset so the range can track a distribution that is not centred on zero. For activations like post-ReLU values that are entirely non-negative, a symmetric scheme wastes half its codes on a range the data never visits, coarsening the granularity where the values actually live. Asymmetric mapping reclaims that range, which is why two INT8 models can carry different error even though both are eight-bit.

When quantization accuracy loss looks unacceptable, how do I isolate the cause?

Start by separating the three candidate causes the article names: calibration, scheme choice, and genuine model sensitivity. Re-run with a workload-shaped calibration set first — if the regression shrinks, the calibration corpus was the problem. If it persists, swap to an outlier-aware scheme such as SmoothQuant or an asymmetric per-channel variant; only after both of those fail to recover accuracy should you conclude the model family is genuinely sensitive at that precision and consider QAT or a higher bit width.