Evaluation

Calibration

Quick Answer

How well a model's confidence scores match actual correctness probability.

Calibration measures whether confidence aligns with actual correctness. A perfectly calibrated model that says 80% confident is correct 80% of the time. Many models are overconfident (say 90% confident but correct only 70%). Poor calibration is problematic for risk-sensitive applications. Temperature affects calibration. Calibration can be improved through fine-tuning. Measuring calibration requires confidence-annotated data.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →