The abstraction that finally settled it

Microsoft.Extensions.AI gives .NET one interface — IChatClient — over any backend: Azure OpenAI today, an Ollama endpoint in your rack tomorrow, with zero changes to business code. For enterprises this is the unlock: the on-prem decision becomes a configuration line, not a rewrite.

Wiring a local model into DI

// Program.cs — .NET 8 minimal API
using Microsoft.Extensions.AI;

var builder = WebApplication.CreateBuilder(args);

// Ollama speaks the OpenAI-compatible API on /v1
builder.Services.AddChatClient(sp =>
    new OpenAI.Chat.ChatClient(
            model: "qwen2.5:7b-instruct-q4_K_M",
            credential: new("ollama"),          // ignored locally
            options: new() { Endpoint =
                new Uri("http://ai-server.intern:11434/v1") })
        .AsIChatClient()
        .AsBuilder()
        .UseLogging()
        .UseFunctionInvocation()                // tool calling, see below
        .Build(sp));

var app = builder.Build();

app.MapPost("/api/draft-reply", async (
    IChatClient chat, TicketDto ticket) =>
{
    var messages = new List<ChatMessage> {
        new(ChatRole.System,
            "Du bist der Support-Assistent der Muster GmbH. " +
            "Antworte höflich, präzise, auf Deutsch."),
        new(ChatRole.User, $"Ticket: {ticket.Subject}\n{ticket.Body}")
    };
    var response = await chat.GetResponseAsync(messages);
    return Results.Ok(new { draft = response.Text });
});

app.Run();

Streaming into Blazor or SignalR

Swap GetResponseAsync for GetStreamingResponseAsync and you get an IAsyncEnumerable of updates — await foreach pushes tokens to the UI as they arrive. Time-to-first-token from a local 7B over LAN is typically under 200 ms; users perceive it as instant in a way no cloud round-trip matches.

Typed structured output

The killer enterprise feature: GetResponseAsync<InvoiceData>(...) serializes your C# record into a JSON schema, constrains the model to it, and deserializes the result — a strongly-typed object or an exception, never a string you regex and pray over. Combined with UseFunctionInvocation() (annotate methods with [Description], the middleware handles the tool-call loop), a local model becomes a first-class citizen of a normal line-of-business architecture: DI, logging, retry policies, unit tests with a fake IChatClient.

Deployment shape we use

App servers stay where they are; one GPU host runs Ollama behind our Go gateway; appsettings.Production.json points at it. The model is infrastructure, like the database — versioned, monitored, and never leaving the building.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment