Architecture

Vocabulary

Quick Answer

The set of all tokens a model can output, learned during training.

The vocabulary is the complete set of tokens the tokenizer can produce. It's learned during tokenizer training on the pretraining corpus. Vocabulary size typically ranges from 30K to 50K tokens for LLMs. Larger vocabularies can represent rare words directly but increase embedding table size. Smaller vocabularies require more tokens to represent text but are more efficient. The vocabulary is fixed at inference—the model cannot generate tokens outside its vocabulary. Vocabulary misalignment (using wrong tokenizer) is a common source of errors.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →