The demo is not evidence
Every fine-tune looks brilliant on the five prompts you tried during training. Whether it should touch production is a different question, answered by a harness, not a vibe. We don't ship a model without one — and the harness outlives the model: it's how you'll judge every future candidate too.
1. The golden set
150–500 real cases with expert-approved expected outputs, frozen, version-controlled, and never seen in training. Stratify it: routine cases, hard cases, edge cases, and — critically — refusal cases, inputs where the correct behavior is "that's not documented." A model that never declines is a liability with good manners.
2. Score what can be scored mechanically
Extraction tasks: exact/fuzzy field match. Classification: F1 per class. RAG: retrieval recall and citation validity (does [2] actually contain the claim?). Mechanical metrics are cheap, deterministic and run on every commit.
3. LLM-as-judge — carefully
For free-text quality (tone, completeness) use a stronger model as judge, but treat the judge as an instrument that needs calibration: score against a rubric, not "rate 1–10"; randomize answer order (judges prefer the first answer); and validate the judge against ~50 human-labeled pairs before trusting it. A judge that agrees with your experts 90% of the time is a tool; an unvalidated one is a random number generator with confidence.
import json, statistics, requests
GOLDEN = [json.loads(l) for l in open("golden_v3.jsonl")]
def model(prompt, name):
r = requests.post("http://127.0.0.1:11434/api/chat", json={
"model": name, "stream": False,
"messages": [{"role": "user", "content": prompt}]})
return r.json()["message"]["content"]
def field_score(out, expected): # mechanical metric
try:
got = json.loads(out)
hits = sum(got.get(k) == v for k, v in expected.items())
return hits / len(expected)
except json.JSONDecodeError:
return 0.0
def run(name):
scores = [field_score(model(c["prompt"], name), c["expected"])
for c in GOLDEN]
return statistics.mean(scores), min(scores)
base = run("support-v2") # current production
cand = run("support-v3-candidate")
print(f"v2 {base[0]:.3f} v3 {cand[0]:.3f}")
# regression gate: mean must improve AND no catastrophic case
assert cand[0] >= base[0] - 0.01 and cand[1] >= 0.4, "BLOCKED"
4. The regression gate
The candidate must beat or match production on the mean and not crater any stratum — a +3% average that breaks refusal behavior is a regression wearing a medal. Wire the gate into CI: training produces an adapter, the harness scores it, only a green run can be promoted. Boring, mechanical, and the single biggest difference between teams that trust their models and teams that hope.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment