Inference — LLM Glossary | LLMversus

Inference

Quick Answer

The process of running a trained model to generate outputs given inputs.

Inference is deployment of a trained model in production. Unlike training which updates weights, inference generates outputs without learning. Inference must be fast and efficient. Optimization techniques for inference: quantization, KV cache, batch processing, speculative decoding. Inference compute dominates real-world AI systems. Most ML compute budget is inference, not training. Understanding inference bottlenecks is crucial for deployment. Different applications have different latency/throughput requirements.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →

← All glossary terms