Architecture

Speculative Decoding

Quick Answer

A technique using a smaller model to predict tokens, verified by the larger model.

Speculative decoding uses a small, fast draft model to predict multiple future tokens, then a large target model verifies them. Verified predictions are accepted; others are redrawn. This reduces latency by amortizing slow model inference. Speculative decoding is a practical inference optimization for deployed systems. It works best when the draft model correlates well with the target model. No accuracy loss occurs—outputs are identical to serial generation. GPU memory and compute requirements are higher.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →