Training
Reward Model
Quick Answer
A model trained to predict human preference, used to guide policy optimization in RLHF.
A reward model predicts how much humans prefer one output over another. It's trained on human preference judgments (pairs of outputs with preference labels). The reward model provides reward signals for RL optimization. Reward model quality heavily impacts RLHF outcomes. Reward model training requires diverse, high-quality preference data. Reward models can overfit to preference data distributions. Recent work explores alternatives to explicit reward models.
Last verified: 2026-04-08