Latency Definition for AI Inference: A Domain-Specific Anchor

“Latency” without a domain isn’t a measurement

The word “latency” appears in performance reports across networking, storage, databases, web services, and AI inference, and the assumption that it means the same thing everywhere is the source of a surprising amount of cross-team miscommunication. In each domain, latency is the elapsed time between two events — but which two events, and what the workload that produces them looks like, differs enough that the numbers are not comparable across domains and not interchangeable within a single benchmark report.

For AI inference, latency has a specific operational meaning. Pinning it down — and distinguishing it from the latency definitions used in adjacent domains — is the prerequisite for reading or producing useful inference benchmark results. In our experience, this is also where most cross-functional disagreements about “performance” originate: two teams use the same word for two different physical quantities and reach different conclusions from the same number.

What is latency in AI inference, precisely?

Latency in AI inference is the elapsed wall-clock time from the arrival of an input request at the inference service to the completion of the corresponding output, measured per request, under a declared batch size, concurrency level, and request arrival pattern.

Three things are notable about that definition.

It is per-request. A single inference latency number describes one request. A batch of requests has many latencies, not one. Reporting “the latency” of a batched system without specifying which request in the batch (or which percentile across requests) is under-specified.

It includes everything between the two events. Queue time, model-load time (if not amortized), framework dispatch through PyTorch or TensorRT, kernel execution on the accelerator, post-processing, and serialization back to the client are all inside the latency envelope. Reports that name only the kernel-execution component as “latency” are reporting a model-execution time, not an inference latency. The two routinely differ by an order of magnitude in production.

It depends on conditions, not just the model and hardware. Batch size, concurrency, and request arrival distribution change the latency the same model produces on the same accelerator. Changing any of these without re-stating the configuration changes the number being reported. This is an observed pattern across the inference deployments we look at: the model and the GPU stay constant, the batch policy moves, and the “latency” number triples or halves with no other change.

How AI inference latency differs from latency in other domains

The numerical units — typically milliseconds — are shared across domains. The physical quantities those units describe are not. The table below names the two events that bound the elapsed-time measurement in each domain, and the conditions that govern it.

Domain	What latency is	What it depends on
Networking	Round-trip transit time of a packet between endpoints	Distance, link bandwidth, queueing, protocol overhead
Storage	Time from I/O request to I/O completion	Queue depth, service time at the storage device, caching layer
Database query	Time from query submission to result return	Query plan, index hit/miss, lock contention, IO subsystem
Web service	Time from HTTP request to response received	Application processing + downstream calls + network legs
AI inference	Time from request arrival to inference output completion	Batch size, concurrency, model size, precision, executor saturation

A networking latency of 5 ms and an inference latency of 5 ms are not comparable as “system performance”; they are reporting on different operations against different infrastructure with different governing dynamics. The networking number describes packet transit; the inference number describes computation that includes everything from queue admission through CUDA kernel launches on the accelerator to result serialization.

A benchmark report that mixes these without scoping each — for example, claiming an “end-to-end latency” of N ms without separating the network leg from the inference leg — is folding incommensurable quantities into a single number that no reader can decompose. The reader cannot tell whether tightening the network would move the headline number by 10% or by nothing at all.

This is also where the distinction between model latency and end-to-end system latency matters operationally. Model latency is the time a forward pass takes once execution has begun on the accelerator; end-to-end system latency is the per-request number defined above. A model-latency improvement of 2 ms is invisible to a user whose end-to-end latency is dominated by queueing under load. Conflating the two is one of the more common reasons inference benchmark results fail to predict production behaviour.

Why a single average latency under-specifies AI inference

For AI inference specifically, the request-to-request variation in latency under load is large enough that a single mean or median number is inadequate as an operational measurement. The reasons are mechanical, not statistical sophistication for its own sake.

Batch effects. When the inference server batches requests, the latency a request experiences depends on where in the batch window it arrived. The first request in a forming batch waits for the batch to fill or the timeout to fire; the last request in a forming batch experiences near-zero queue time but the same kernel execution time.
Concurrency effects. Under sustained concurrent load, queue depth fluctuates, and request latencies spread accordingly. Average latency under a load pattern hides the worst-case behaviour the system is actually exposed to.
Saturation effects. As load approaches the AI Executor’s saturation point, latency distributions become heavy-tailed: a small fraction of requests experience much larger latencies than the median while the median moves only slightly. The mean drifts up; the tail explodes.

The minimum useful reporting unit for AI inference latency is therefore a percentile distribution under declared load conditions: p50, p95, p99 — and frequently p99.9 for latency-sensitive systems — at a stated batch size, concurrency, and arrival distribution. A single average number under load is structurally incapable of expressing what a latency-sensitive deployment needs to know about the system. We treat this as an observed pattern across the latency-sensitive deployments we audit, not as a benchmarked rate from a single named test.

The strategic argument — when latency is the right target at all, versus when throughput should be — lives in throughput vs latency trade-offs. Operationally, the trade-off between the two metrics is governed by the latency distribution, not by a point estimate of latency, and benchmarks that report point estimates leave the trade-off un-evaluable.

What disclosure makes an AI latency number meaningful?

A latency number for AI inference becomes interpretable when the report names:

The model and its size.
The precision regime of the inference (FP32 / FP16 / BF16 / INT8 / FP8 / quantization scheme).
The AI Executor — accelerator, driver, runtime, framework (PyTorch, TensorRT, ONNX Runtime), and inference server.
The batch size policy: static, dynamic with timeout, or continuous batching.
The concurrency level under which latency was measured.
The request arrival distribution: closed-loop, open-loop, or a specific load shape.
Which percentiles are reported (mean alone is insufficient).
Whether warm-up was excluded and how long the measurement window was.

A latency report that satisfies this list is informative. A latency report that names a number without these dimensions is reporting on an unspecified executor under unspecified conditions, and any reader who tries to compare it to their own deployment is comparing a known thing against an unknown thing. The numerical answer is the same; the epistemic status is incommensurable.

The framing that helps

Latency for AI inference is the per-request, end-to-end elapsed time from request arrival to output completion under a declared batch, concurrency, and load configuration — and it is a different physical quantity than network, storage, database, or web latency. A useful AI latency report names percentiles under declared conditions, not an average without context. LynxBench AI treats latency as a distribution measured under disclosed batch, concurrency, and arrival conditions on a fully-specified AI Executor — because point-estimate latency under unspecified load is structurally incapable of informing the deployment decisions latency-sensitive systems exist to make. Before the next AI latency number anchors a deployment decision, which percentile — under what batch policy, what concurrency, what arrival distribution, on which AI Executor — produced the figure, and is that the right metric for this workload at the operating point the production load profile will actually inhabit?

Frequently Asked Questions

Does AI inference latency mean the same thing as networking or storage latency?

No. The units are shared — usually milliseconds — but the physical quantities are not. Networking latency measures packet round-trip transit; AI inference latency measures the per-request elapsed time from request arrival to output completion, including queue time, framework dispatch, kernel execution, and serialization. A 5 ms networking number and a 5 ms inference number describe different operations on different infrastructure and are not comparable as “system performance.”

What disclosures make a single AI inference latency number actually interpretable?

A latency figure becomes meaningful only when the report names the model and its size, the precision regime, the full AI Executor (accelerator, driver, runtime, framework, inference server), the batch-size policy, the concurrency level, the request arrival distribution, which percentiles are reported, and whether warm-up was excluded. Without those dimensions, the number describes an unspecified executor under unspecified conditions and cannot be compared against another deployment.

Why does the same model on the same GPU report wildly different latency numbers?

Because latency depends on operating conditions, not just the model and hardware. Batch size, concurrency, and request arrival distribution all change the latency a fixed model produces on a fixed accelerator. In the inference deployments we look at, the model and GPU stay constant while the batch policy moves, and the headline latency number can triple or halve with no other change — which is why a number quoted without its configuration is under-specified.

Which percentiles should a latency-sensitive AI deployment actually track?

The minimum useful reporting unit is a percentile distribution under declared load: p50, p95, and p99, with p99.9 frequently needed for latency-sensitive systems, each tied to a stated batch size, concurrency level, and arrival distribution. As load approaches the executor’s saturation point the distribution becomes heavy-tailed, so a small fraction of requests experience much larger latencies while the median barely moves — behaviour a single average is structurally incapable of revealing.