The problem: prose is not an interface
"Please respond in JSON" works 94% of the time, and the other 6% takes your parser down at 2 a.m. For systems integration you need guaranteed structure — and locally, you can get a stronger guarantee than any cloud API offers, because you control the decoder itself.
Grammar-constrained decoding
llama.cpp (and therefore Ollama) can constrain sampling so that only tokens that keep the output valid against a grammar are ever considered. Invalid JSON isn't filtered out afterwards — it is unrepresentable. Ollama exposes this as the format parameter, taking a JSON Schema directly:
import requests, json
from pydantic import BaseModel, Field
class Invoice(BaseModel):
vendor: str
invoice_number: str
date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")
net_amount: float
vat_amount: float
iban: str | None
r = requests.post("http://127.0.0.1:11434/api/chat", json={
"model": "qwen2.5:7b-instruct-q4_K_M",
"stream": False,
"format": Invoice.model_json_schema(), # the contract
"messages": [
{"role": "system",
"content": "Extrahiere die Rechnungsdaten. Fehlende Felder: null."},
{"role": "user", "content": open("rechnung_0142.txt").read()},
],
})
inv = Invoice.model_validate_json(r.json()["message"]["content"])
print(inv.net_amount + inv.vat_amount) # typed, validated, done
Function calling on a 7B
Tool use is structured output wearing a trench coat: the model emits {"name": "get_order_status", "arguments": {...}}, your code executes, the result goes back as a message, the model continues. Small models handle this well if you keep the tool list short (≤6), write argument descriptions like API docs, and validate every call against the schema before executing — the model proposes, your code disposes.
The three failure modes that remain
Schema-valid but wrong (the constraint guarantees shape, not truth — keep an eval set), over-eager nulls when documents are messy (fine-tune on your formats, our LoRA guide), and enum hallucination pressure (a forced choice among five categories needs an explicit "other" escape value). Constrained decoding plus a validation gate plus evals: that's the stack that lets a local model write directly into your ERP — which is exactly the pattern in most of our extraction deployments.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment