Inference
Throughput
Quick Answer
The number of tokens generated per unit time, measuring inference speed at scale.
Throughput measures how many tokens a system produces per second. It's crucial for batch processing and high-volume systems. Higher throughput enables handling more concurrent requests. Throughput depends on hardware utilization and batch size. Techniques improving throughput: batching, quantization, efficient attention. Latency and throughput are often trade-offs—optimizing one may hurt the other. Different applications prioritize differently.
Last verified: 2026-04-08