Ollama is production-grade — if you operate it like a service
Ollama wraps llama.cpp with a model registry and a clean HTTP API. The gap between laptop demo and department workhorse is configuration, not capability.
Systemd hardening
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434" # never 0.0.0.0
Environment="OLLAMA_NUM_PARALLEL=4" # concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=24h" # no cold starts at 9:00
ProtectSystem=full
NoNewPrivileges=true
# preload the workhorse at boot
ExecStartPost=/usr/bin/curl -s http://127.0.0.1:11434/api/generate \
-d '{"model":"qwen2.5:7b-instruct-q4_K_M","keep_alive":"24h"}'
A reverse proxy your IT team will accept
Bind Ollama to localhost and put an authenticating proxy in front — TLS from your internal CA, an API key per department, request size limits, and structured logs for the audit trail:
import http from "node:http";
import { request } from "node:http";
const KEYS = new Set(process.env.API_KEYS.split(","));
http.createServer((req, res) => {
if (!KEYS.has(req.headers["x-api-key"] ?? "")) {
res.writeHead(401).end(); return;
}
const up = request(
{ host: "127.0.0.1", port: 11434, path: req.url,
method: req.method, headers: req.headers },
(u) => { res.writeHead(u.statusCode, u.headers); u.pipe(res); });
req.pipe(up); // streams pass straight through
console.log(JSON.stringify({ t: Date.now(), p: req.url,
dept: req.headers["x-api-key"].slice(0, 8) }));
}).listen(8443);
Capacity math
One RTX 4090 (24 GB) serves a 7B-Q4 model at roughly 80–110 tokens/s for a single stream; with NUM_PARALLEL=4, four users share that throughput. For a 40-person department with bursty usage, one such box is typically enough — which is exactly the hardware story that makes the on-prem economics work.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment