How to Evaluate LLM Output Quality: A Practical Guide
Quick answer: Build a small but representative evaluation set (50-200 examples) for your specific task, define a clear scoring rubric, and use LLM-as-judge to score outputs. Run evals before every model change and prompt change. Without evals, every production change is a gamble.
Why evaluation matters
LLM outputs are non-deterministic and hard to unit-test. Without systematic evaluation:
- You don't know if a model upgrade improved or hurt your product
- Prompt changes can silently break edge cases
- Different model tiers may have invisible quality tradeoffs
- You can't benchmark across providers objectively
Evaluation turns LLM development from "vibes-driven" to engineering.
Step 1: Build your evaluation dataset
A good eval set has:
- Representative examples: Covers the realistic distribution of your use case
- Edge cases: Includes difficult, ambiguous, or rare inputs
- Ground truth: Clear expected outputs for each input
- Scale: 50-200 examples for most tasks; 500+ for complex classification
# Example eval dataset structure
eval_dataset = [
{
"id": "01",
"input": "Summarize: The Federal Reserve raised interest rates...",
"expected_output": "[Ground truth summary]",
"criteria": ["accuracy", "brevity", "key_facts_included"]
},
# ...
]
Step 2: Define scoring criteria
For each task, define 3-5 specific criteria to score:
For summarization:
- Factual accuracy (0-2): No hallucinations, key facts correct
- Completeness (0-2): Covers main points
- Conciseness (0-2): No padding, appropriate length
For code generation:
- Correctness (0-3): Code runs and produces correct output
- Code quality (0-2): Readable, idiomatic, handles edge cases
For customer support:
- Helpfulness (0-3): Resolves or addresses the issue
- Tone (0-2): Professional, empathetic
- Accuracy (0-2): Factually correct
Step 3: LLM-as-judge
For tasks where automated metrics don't apply (open-ended generation, nuanced quality), use a powerful LLM to score outputs:
def llm_judge(task_input: str, model_output: str, criteria: list[str]) -> dict:
judge_prompt = f"""
You are an expert evaluator. Score the following LLM output on each criterion.
Task input: {task_input}
Model output: {model_output}
Score each criterion from 0-3:
{chr(10).join(f'- {c}' for c in criteria)}
Return JSON: {{"scores": {{"criterion": score}}, "reasoning": "...", "total": N}}
"""
response = client.messages.create(
model="claude-sonnet-4", # Use a strong judge model
max_tokens=512,
messages=[{"role": "user", "content": judge_prompt}]
)
return json.loads(response.content[0].text)
# Run evaluation
def run_eval(test_fn, dataset: list) -> dict:
scores = []
for example in dataset:
output = test_fn(example["input"])
score = llm_judge(example["input"], output, example["criteria"])
scores.append(score["total"])
return {
"mean_score": sum(scores) / len(scores),
"min_score": min(scores),
"pass_rate": sum(1 for s in scores if s >= 6) / len(scores)
}
Step 4: Automated metrics where applicable
For tasks with measurable outputs:
Classification: Accuracy, precision, recall, F1 Code generation: Unit test pass rate, execution success rate Extraction: Exact match, F1 on extracted fields Translation: BLEU score (rough proxy, not perfect) Summarization: ROUGE score + LLM-as-judge
from sklearn.metrics import classification_report
y_true = [ex["expected"] for ex in eval_dataset]
y_pred = [classify(ex["input"]) for ex in eval_dataset]
print(classification_report(y_true, y_pred))
Step 5: Regression testing pipeline
# Run evals before and after any change
import json
def save_eval_results(model: str, results: dict, filename: str):
with open(filename, "w") as f:
json.dump({"model": model, "timestamp": datetime.now().isoformat(), **results}, f)
# Before change
before = run_eval(lambda x: call_model(x, "claude-haiku-4"), eval_dataset)
save_eval_results("claude-haiku-4", before, "eval_before.json")
# After change
after = run_eval(lambda x: call_model(x, "claude-sonnet-4"), eval_dataset)
save_eval_results("claude-sonnet-4", after, "eval_after.json")
print(f"Score change: {after['mean_score'] - before['mean_score']:+.2f}")
Evaluation frameworks
Langsmith (LangChain): Full eval pipeline, good for LangChain-based apps Braintrust: Clean UI, good for fast iteration, production-grade Langfuse: Open source, good for teams wanting self-hosted evals Promptfoo: CLI-focused, good for prompt A/B testing OpenAI Evals: Good for OpenAI model comparisons
For benchmarking which models perform best on your use case, start with the best LLM APIs ranking and validate against your own eval set.