Inference
Inference
Quick Answer
The process of running a trained model to generate outputs given inputs.
Inference is deployment of a trained model in production. Unlike training which updates weights, inference generates outputs without learning. Inference must be fast and efficient. Optimization techniques for inference: quantization, KV cache, batch processing, speculative decoding. Inference compute dominates real-world AI systems. Most ML compute budget is inference, not training. Understanding inference bottlenecks is crucial for deployment. Different applications have different latency/throughput requirements.
Last verified: 2026-04-08