How to Fine-Tune an LLM in 2026: When to Do It and How

Quick answer: Fine-tuning is rarely the first thing you should try. Exhaust prompt engineering and RAG first — they're cheaper, faster to iterate, and often sufficient. Fine-tuning makes sense when: you need consistent style/format that prompting can't reliably achieve, you have 1,000+ high-quality labeled examples, and you need to reduce inference costs via a smaller fine-tuned model.

When fine-tuning is worth it

Good reasons to fine-tune:

Consistent brand voice or proprietary format that prompting achieves only 80% of the time
Reducing token usage by training a small model to match a large model's output on a narrow task
Classifying or extracting from proprietary domain data where general models underperform
Meeting latency requirements by using a smaller fine-tuned model instead of a larger prompted one

Bad reasons to fine-tune:

"We want the model to know our company facts" → Use RAG instead
"We want better reasoning" → Fine-tuning doesn't improve fundamental reasoning; better base models do
"Prompting is inconsistent on a few examples" → More examples and better prompts first
You have fewer than 200 labeled examples → Not enough data

OpenAI fine-tuning (managed)

OpenAI offers the easiest fine-tuning path for GPT-4o Mini and GPT-4.1 Mini.

Step 1: Prepare training data

{"messages": [{"role": "system", "content": "Classify support tickets."}, {"role": "user", "content": "I can't log in"}, {"role": "assistant", "content": "account_access"}]}
{"messages": [{"role": "system", "content": "Classify support tickets."}, {"role": "user", "content": "Billing charged twice"}, {"role": "assistant", "content": "billing"}]}

Target: 100-1,000 examples minimum. 1,000+ for best results.

Step 2: Upload and train

import openai

client = openai.OpenAI()

# Upload training file
with open("training.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini",
)

print(f"Job ID: {job.id}")

Step 3: Monitor and use

# Check status
job = client.fine_tuning.jobs.retrieve(job.id)
print(job.status)  # "running", "succeeded", "failed"

# Use the fine-tuned model
response = client.chat.completions.create(
    model=job.fine_tuned_model,  # e.g. "ft:gpt-4o-mini:org:name:id"
    messages=[{"role": "user", "content": "I was charged twice"}]
)

Pricing: Training costs $0.025/1K tokens. Inference on fine-tuned GPT-4o Mini costs $0.30/1M input, $1.20/1M output — slightly more than the base model.

LoRA fine-tuning for open-source models

For open-source models (Llama 4, Mistral, Phi-4), LoRA (Low-Rank Adaptation) is the standard approach — it trains a small adapter rather than all model weights, making fine-tuning feasible on a single GPU.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Scout",
    load_in_4bit=True,  # QLoRA for memory efficiency
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,           # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Typically <1% of total params

Tools: Unsloth (fast LoRA), Axolotl (production training), LLaMA-Factory (GUI).

Cost comparison: fine-tuned small vs prompted large

Scenario: 10M token/month classification task

Option A: Prompted GPT-4o ($2.50/1M input)

Monthly cost: $25,000

Option B: Fine-tuned GPT-4.1 Nano ($0.10/1M input)

Training: ~$25 one-time
Monthly cost: $1,000
Monthly savings: $24,000

At this volume, fine-tuning a small model pays back its training cost in the first hour of production use.

The decision framework

Try prompting → if quality <85%, try with 10-shot examples → if still <85%, try RAG → if still <85% and you have 1,000+ examples, fine-tune
Calculate the inference cost delta between a fine-tuned small model and the current model — if it exceeds fine-tuning cost within 3 months, fine-tune

See best LLMs for developers for the current top base models available for fine-tuning.