Architecture

Browser → your Node.js service → Ollama on localhost. The Node layer owns authentication, the system prompt, retrieval over your ticket history, and streaming. Nothing in this chain resolves a public hostname.

A streaming chat backend with SSE

import http from "node:http";

const SYSTEM = `You are the support assistant of Muster GmbH.
Answer in German, cite ticket IDs from CONTEXT, and say
"Das ist nicht dokumentiert" when the context lacks the answer.`;

async function retrieve(q) {          // your vector store here
  const r = await fetch("http://127.0.0.1:7700/search", {
    method: "POST", body: JSON.stringify({ q, k: 5 }) });
  return (await r.json()).hits.map(h => `[${h.id}] ${h.text}`).join("\n");
}

http.createServer(async (req, res) => {
  if (req.method !== "POST" || req.url !== "/chat") {
    res.writeHead(404).end(); return;
  }
  let body = ""; for await (const c of req) body += c;
  const { question } = JSON.parse(body);

  res.writeHead(200, { "Content-Type": "text/event-stream",
                       "Cache-Control": "no-cache" });

  const context = await retrieve(question);
  const upstream = await fetch("http://127.0.0.1:11434/api/chat", {
    method: "POST",
    body: JSON.stringify({
      model: "qwen2.5:7b-instruct-q4_K_M", stream: true,
      messages: [
        { role: "system", content: SYSTEM },
        { role: "user", content: `CONTEXT:\n${context}\n\nFRAGE: ${question}` }
      ]})});

  for await (const chunk of upstream.body) {
    for (const line of chunk.toString().trim().split("\n")) {
      const tok = JSON.parse(line)?.message?.content ?? "";
      if (tok) res.write(`data: ${JSON.stringify(tok)}\n\n`);
    }
  }
  res.write("data: [DONE]\n\n"); res.end();
}).listen(3000);

The details that make it production

Add: per-user rate limits (a simple token bucket), a 30-second upstream timeout with a polite SSE error event, request logging without message bodies (GDPR), and a nightly job that flags answers users rated down for the next fine-tuning round. The model improves on a loop of your own data — and the loop never leaves your network.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment