Inference

vLLM

Quick Answer

A high-throughput, memory-efficient LLM inference engine with optimized batching and caching.

vLLM is an inference engine designed for high-throughput, low-latency LLM inference. It uses PagedAttention (virtual memory for KV cache) enabling very high batching. vLLM dramatically improves throughput compared to naive implementations. It supports tensor parallelism and quantization. vLLM is open-source and widely used. It enables running 70B models at high throughput on a single GPU. vLLM is practical for deploying LLMs efficiently.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →

← All glossary terms