Inference

Latency

Quick Answer

The time delay from input to first output (time-to-first-token) or complete output.

Latency is the time from sending input to receiving output. Two key metrics: time-to-first-token (TTFT, important for interactivity) and end-to-end latency (total time). Lower latency is critical for interactive applications (chat, real-time translation). Latency depends on model size, hardware, and implementation. Inference optimization reduces latency. Streaming outputs can appear faster despite same total latency. Latency and throughput are often trade-offs.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →