Training

Pretraining

Quick Answer

The initial training phase where models learn language from vast unlabeled text corpora.

Pretraining is the first phase of LLM training where models learn from enormous corpora of diverse text. The objective is typically next-token prediction (language modeling). During pretraining, the model learns grammar, facts, reasoning, and coding from data. Pretraining is computationally expensive (billions of dollars for state-of-the-art models). Quality and diversity of pretraining data heavily influence model quality. Most open-source models release pretrained versions that are further fine-tuned. Pretraining creates general-purpose models; specialization comes through fine-tuning.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →

← All glossary terms