Information Extraction with Azure Document Intelligence: Invoices to JSON

Document Intelligence: the extraction workhorse

Azure AI Document Intelligence reads PDFs and scans into structured data. The prebuilt invoice model already understands European invoice layouts — VAT IDs, IBANs, line items, German field labels — and the custom extraction trainer learns your own forms from as few as five labeled samples.

Invoices → ERP-ready JSON in C#

using Azure;
using Azure.AI.DocumentIntelligence;

var client = new DocumentIntelligenceClient(
    new Uri(cfg["DI_ENDPOINT"]), new AzureKeyCredential(cfg["DI_KEY"]));

var op = await client.AnalyzeDocumentAsync(
    WaitUntil.Completed, "prebuilt-invoice",
    BinaryData.FromBytes(File.ReadAllBytes("eingang_2026_0142.pdf")));

var doc = op.Value.Documents[0];

string F(string name) =>
    doc.Fields.TryGetValue(name, out var f) && f.Confidence > 0.85
        ? f.Content : throw new LowConfidenceException(name);

var record = new {
    Vendor    = F("VendorName"),
    VendorVat = F("VendorTaxId"),
    Number    = F("InvoiceId"),
    Date      = F("InvoiceDate"),
    Net       = F("SubTotal"),
    Vat       = F("TotalTax"),
    Gross     = F("InvoiceTotal"),
    Iban      = doc.Fields["PaymentDetails"].Content
};
// low-confidence fields fall into a human review queue —
// extraction without a confidence gate is how wrong IBANs get paid.

The pipeline around the API

Production extraction is 20% model, 80% process: a watch folder or mail ingestion, deduplication by content hash, the confidence gate above, a review UI for the 5–10% of fields that need eyes, and write-back to the ERP with a full audit trail. Build the review loop first; it's also your future training data.

The on-prem variant

Document Intelligence ships as a container for disconnected environments — the model runs in your Docker host, documents never leave, and only billing telemetry goes out. For customers whose invoices contain trade-secret pricing, that container plus a local LLM for the unstructured remainder (delivery conditions, free-text clauses) is our standard recommendation.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment