llm-evaluationevalsllm-testingquality-assurancebenchmarks

How to Evaluate LLM Output Quality: A Practical Guide

How to Evaluate LLM Output Quality: A Practical Guide

Quick answer: Build a small but representative evaluation set (50-200 examples) for your specific task, define a clear scoring rubric, and use LLM-as-judge to score outputs. Run evals before every model change and prompt change. Without evals, every production change is a gamble.


Why evaluation matters

LLM outputs are non-deterministic and hard to unit-test. Without systematic evaluation:

  • You don't know if a model upgrade improved or hurt your product
  • Prompt changes can silently break edge cases
  • Different model tiers may have invisible quality tradeoffs
  • You can't benchmark across providers objectively

Evaluation turns LLM development from "vibes-driven" to engineering.


Step 1: Build your evaluation dataset

A good eval set has:

  • Representative examples: Covers the realistic distribution of your use case
  • Edge cases: Includes difficult, ambiguous, or rare inputs
  • Ground truth: Clear expected outputs for each input
  • Scale: 50-200 examples for most tasks; 500+ for complex classification

# Example eval dataset structure
eval_dataset = [
    {
        "id": "01",
        "input": "Summarize: The Federal Reserve raised interest rates...",
        "expected_output": "[Ground truth summary]",
        "criteria": ["accuracy", "brevity", "key_facts_included"]
    },
    # ...
]


Step 2: Define scoring criteria

For each task, define 3-5 specific criteria to score:

For summarization:

  • Factual accuracy (0-2): No hallucinations, key facts correct
  • Completeness (0-2): Covers main points
  • Conciseness (0-2): No padding, appropriate length

For code generation:

  • Correctness (0-3): Code runs and produces correct output
  • Code quality (0-2): Readable, idiomatic, handles edge cases

For customer support:

  • Helpfulness (0-3): Resolves or addresses the issue
  • Tone (0-2): Professional, empathetic
  • Accuracy (0-2): Factually correct


Step 3: LLM-as-judge

For tasks where automated metrics don't apply (open-ended generation, nuanced quality), use a powerful LLM to score outputs:

def llm_judge(task_input: str, model_output: str, criteria: list[str]) -> dict:
    judge_prompt = f"""
You are an expert evaluator. Score the following LLM output on each criterion.

Task input: {task_input}

Model output: {model_output}

Score each criterion from 0-3:
{chr(10).join(f'- {c}' for c in criteria)}

Return JSON: {{"scores": {{"criterion": score}}, "reasoning": "...", "total": N}}
"""
    response = client.messages.create(
        model="claude-sonnet-4",  # Use a strong judge model
        max_tokens=512,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    return json.loads(response.content[0].text)

# Run evaluation
def run_eval(test_fn, dataset: list) -> dict:
    scores = []
    for example in dataset:
        output = test_fn(example["input"])
        score = llm_judge(example["input"], output, example["criteria"])
        scores.append(score["total"])
    
    return {
        "mean_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "pass_rate": sum(1 for s in scores if s >= 6) / len(scores)
    }


Step 4: Automated metrics where applicable

For tasks with measurable outputs:

Classification: Accuracy, precision, recall, F1 Code generation: Unit test pass rate, execution success rate Extraction: Exact match, F1 on extracted fields Translation: BLEU score (rough proxy, not perfect) Summarization: ROUGE score + LLM-as-judge

from sklearn.metrics import classification_report

y_true = [ex["expected"] for ex in eval_dataset]
y_pred = [classify(ex["input"]) for ex in eval_dataset]
print(classification_report(y_true, y_pred))


Step 5: Regression testing pipeline

# Run evals before and after any change
import json

def save_eval_results(model: str, results: dict, filename: str):
    with open(filename, "w") as f:
        json.dump({"model": model, "timestamp": datetime.now().isoformat(), **results}, f)

# Before change
before = run_eval(lambda x: call_model(x, "claude-haiku-4"), eval_dataset)
save_eval_results("claude-haiku-4", before, "eval_before.json")

# After change
after = run_eval(lambda x: call_model(x, "claude-sonnet-4"), eval_dataset)
save_eval_results("claude-sonnet-4", after, "eval_after.json")

print(f"Score change: {after['mean_score'] - before['mean_score']:+.2f}")


Evaluation frameworks

Langsmith (LangChain): Full eval pipeline, good for LangChain-based apps Braintrust: Clean UI, good for fast iteration, production-grade Langfuse: Open source, good for teams wanting self-hosted evals Promptfoo: CLI-focused, good for prompt A/B testing OpenAI Evals: Good for OpenAI model comparisons

For benchmarking which models perform best on your use case, start with the best LLM APIs ranking and validate against your own eval set.

Your ad here

Related Tools