The problem: prose is not an interface

"Please respond in JSON" works 94% of the time, and the other 6% takes your parser down at 2 a.m. For systems integration you need guaranteed structure — and locally, you can get a stronger guarantee than any cloud API offers, because you control the decoder itself.

Grammar-constrained decoding

llama.cpp (and therefore Ollama) can constrain sampling so that only tokens that keep the output valid against a grammar are ever considered. Invalid JSON isn't filtered out afterwards — it is unrepresentable. Ollama exposes this as the format parameter, taking a JSON Schema directly:

import requests, json
from pydantic import BaseModel, Field

class Invoice(BaseModel):
    vendor: str
    invoice_number: str
    date: str = Field(pattern=r"\d{4}-\d{2}-\d{2}")
    net_amount: float
    vat_amount: float
    iban: str | None

r = requests.post("http://127.0.0.1:11434/api/chat", json={
    "model": "qwen2.5:7b-instruct-q4_K_M",
    "stream": False,
    "format": Invoice.model_json_schema(),      # the contract
    "messages": [
        {"role": "system",
         "content": "Extrahiere die Rechnungsdaten. Fehlende Felder: null."},
        {"role": "user", "content": open("rechnung_0142.txt").read()},
    ],
})

inv = Invoice.model_validate_json(r.json()["message"]["content"])
print(inv.net_amount + inv.vat_amount)   # typed, validated, done

Function calling on a 7B

Tool use is structured output wearing a trench coat: the model emits {"name": "get_order_status", "arguments": {...}}, your code executes, the result goes back as a message, the model continues. Small models handle this well if you keep the tool list short (≤6), write argument descriptions like API docs, and validate every call against the schema before executing — the model proposes, your code disposes.

The three failure modes that remain

Schema-valid but wrong (the constraint guarantees shape, not truth — keep an eval set), over-eager nulls when documents are messy (fine-tune on your formats, our LoRA guide), and enum hallucination pressure (a forced choice among five categories needs an explicit "other" escape value). Constrained decoding plus a validation gate plus evals: that's the stack that lets a local model write directly into your ERP — which is exactly the pattern in most of our extraction deployments.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment