Inference
vLLM
Quick Answer
A high-throughput, memory-efficient LLM inference engine with optimized batching and caching.
vLLM is an inference engine designed for high-throughput, low-latency LLM inference. It uses PagedAttention (virtual memory for KV cache) enabling very high batching. vLLM dramatically improves throughput compared to naive implementations. It supports tensor parallelism and quantization. vLLM is open-source and widely used. It enables running 70B models at high throughput on a single GPU. vLLM is practical for deploying LLMs efficiently.
Last verified: 2026-04-08