Architecture
Byte-Pair Encoding
Quick Answer
A tokenization algorithm that iteratively merges the most frequent byte/character pairs.
Byte-pair encoding (BPE) iteratively finds the most frequent pair of tokens in the corpus and merges them into a new token. This process repeats until reaching a target vocabulary size. BPE efficiently learns subword units, enabling the tokenizer to handle unseen words by decomposing them into known subwords. BPE is widely used in LLMs. It provides a good balance between vocabulary size and compression. The BPE vocabulary learned on one corpus may not generalize well to very different text distributions.
Last verified: 2026-04-08