The abstraction that finally settled it
Microsoft.Extensions.AI gives .NET one interface — IChatClient — over any backend: Azure OpenAI today, an Ollama endpoint in your rack tomorrow, with zero changes to business code. For enterprises this is the unlock: the on-prem decision becomes a configuration line, not a rewrite.
Wiring a local model into DI
// Program.cs — .NET 8 minimal API
using Microsoft.Extensions.AI;
var builder = WebApplication.CreateBuilder(args);
// Ollama speaks the OpenAI-compatible API on /v1
builder.Services.AddChatClient(sp =>
new OpenAI.Chat.ChatClient(
model: "qwen2.5:7b-instruct-q4_K_M",
credential: new("ollama"), // ignored locally
options: new() { Endpoint =
new Uri("http://ai-server.intern:11434/v1") })
.AsIChatClient()
.AsBuilder()
.UseLogging()
.UseFunctionInvocation() // tool calling, see below
.Build(sp));
var app = builder.Build();
app.MapPost("/api/draft-reply", async (
IChatClient chat, TicketDto ticket) =>
{
var messages = new List<ChatMessage> {
new(ChatRole.System,
"Du bist der Support-Assistent der Muster GmbH. " +
"Antworte höflich, präzise, auf Deutsch."),
new(ChatRole.User, $"Ticket: {ticket.Subject}\n{ticket.Body}")
};
var response = await chat.GetResponseAsync(messages);
return Results.Ok(new { draft = response.Text });
});
app.Run();
Streaming into Blazor or SignalR
Swap GetResponseAsync for GetStreamingResponseAsync and you get an IAsyncEnumerable of updates — await foreach pushes tokens to the UI as they arrive. Time-to-first-token from a local 7B over LAN is typically under 200 ms; users perceive it as instant in a way no cloud round-trip matches.
Typed structured output
The killer enterprise feature: GetResponseAsync<InvoiceData>(...) serializes your C# record into a JSON schema, constrains the model to it, and deserializes the result — a strongly-typed object or an exception, never a string you regex and pray over. Combined with UseFunctionInvocation() (annotate methods with [Description], the middleware handles the tool-call loop), a local model becomes a first-class citizen of a normal line-of-business architecture: DI, logging, retry policies, unit tests with a fake IChatClient.
Deployment shape we use
App servers stay where they are; one GPU host runs Ollama behind our Go gateway; appsettings.Production.json points at it. The model is infrastructure, like the database — versioned, monitored, and never leaving the building.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment