Training

Direct Preference Optimization (DPO)

Quick Answer

A training method that directly optimizes for human preferences without training a separate reward model.

DPO is a simpler, more efficient alternative to RLHF. Instead of training a separate reward model, DPO directly trains the model to maximize the difference between preferred and dispreferred outputs. DPO has several advantages: it's simpler to implement, more stable, and more parameter-efficient. Empirically, DPO achieves comparable results to RLHF with less computation. It requires preference pairs (chosen/rejected) rather than scalar rewards. DPO is increasingly popular as an alternative to RLHF. It makes alignment training more accessible.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →