Evaluating Fine-Tuned Models: Build an Eval Suite Before You Trust the Demo

The demo is not evidence

Every fine-tune looks brilliant on the five prompts you tried during training. Whether it should touch production is a different question, answered by a harness, not a vibe. We don't ship a model without one — and the harness outlives the model: it's how you'll judge every future candidate too.

1. The golden set

150–500 real cases with expert-approved expected outputs, frozen, version-controlled, and never seen in training. Stratify it: routine cases, hard cases, edge cases, and — critically — refusal cases, inputs where the correct behavior is "that's not documented." A model that never declines is a liability with good manners.

2. Score what can be scored mechanically

Extraction tasks: exact/fuzzy field match. Classification: F1 per class. RAG: retrieval recall and citation validity (does [2] actually contain the claim?). Mechanical metrics are cheap, deterministic and run on every commit.

3. LLM-as-judge — carefully

For free-text quality (tone, completeness) use a stronger model as judge, but treat the judge as an instrument that needs calibration: score against a rubric, not "rate 1–10"; randomize answer order (judges prefer the first answer); and validate the judge against ~50 human-labeled pairs before trusting it. A judge that agrees with your experts 90% of the time is a tool; an unvalidated one is a random number generator with confidence.

import json, statistics, requests

GOLDEN = [json.loads(l) for l in open("golden_v3.jsonl")]

def model(prompt, name):
    r = requests.post("http://127.0.0.1:11434/api/chat", json={
        "model": name, "stream": False,
        "messages": [{"role": "user", "content": prompt}]})
    return r.json()["message"]["content"]

def field_score(out, expected):           # mechanical metric
    try:
        got = json.loads(out)
        hits = sum(got.get(k) == v for k, v in expected.items())
        return hits / len(expected)
    except json.JSONDecodeError:
        return 0.0

def run(name):
    scores = [field_score(model(c["prompt"], name), c["expected"])
              for c in GOLDEN]
    return statistics.mean(scores), min(scores)

base = run("support-v2")                  # current production
cand = run("support-v3-candidate")
print(f"v2 {base[0]:.3f}  v3 {cand[0]:.3f}")

# regression gate: mean must improve AND no catastrophic case
assert cand[0] >= base[0] - 0.01 and cand[1] >= 0.4, "BLOCKED"

4. The regression gate

The candidate must beat or match production on the mean and not crater any stratum — a +3% average that breaks refusal behavior is a regression wearing a medal. Wire the gate into CI: training produces an adapter, the harness scores it, only a green run can be promoted. Boring, mechanical, and the single biggest difference between teams that trust their models and teams that hope.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment