Architecture
Browser → your Node.js service → Ollama on localhost. The Node layer owns authentication, the system prompt, retrieval over your ticket history, and streaming. Nothing in this chain resolves a public hostname.
A streaming chat backend with SSE
import http from "node:http";
const SYSTEM = `You are the support assistant of Muster GmbH.
Answer in German, cite ticket IDs from CONTEXT, and say
"Das ist nicht dokumentiert" when the context lacks the answer.`;
async function retrieve(q) { // your vector store here
const r = await fetch("http://127.0.0.1:7700/search", {
method: "POST", body: JSON.stringify({ q, k: 5 }) });
return (await r.json()).hits.map(h => `[${h.id}] ${h.text}`).join("\n");
}
http.createServer(async (req, res) => {
if (req.method !== "POST" || req.url !== "/chat") {
res.writeHead(404).end(); return;
}
let body = ""; for await (const c of req) body += c;
const { question } = JSON.parse(body);
res.writeHead(200, { "Content-Type": "text/event-stream",
"Cache-Control": "no-cache" });
const context = await retrieve(question);
const upstream = await fetch("http://127.0.0.1:11434/api/chat", {
method: "POST",
body: JSON.stringify({
model: "qwen2.5:7b-instruct-q4_K_M", stream: true,
messages: [
{ role: "system", content: SYSTEM },
{ role: "user", content: `CONTEXT:\n${context}\n\nFRAGE: ${question}` }
]})});
for await (const chunk of upstream.body) {
for (const line of chunk.toString().trim().split("\n")) {
const tok = JSON.parse(line)?.message?.content ?? "";
if (tok) res.write(`data: ${JSON.stringify(tok)}\n\n`);
}
}
res.write("data: [DONE]\n\n"); res.end();
}).listen(3000);
The details that make it production
Add: per-user rate limits (a simple token bucket), a 30-second upstream timeout with a polite SSE error event, request logging without message bodies (GDPR), and a nightly job that flags answers users rated down for the next fine-tuning round. The model improves on a loop of your own data — and the loop never leaves your network.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment