Running Ollama in Production: From Laptop Toy to Department Workhorse

Ollama is production-grade — if you operate it like a service

Ollama wraps llama.cpp with a model registry and a clean HTTP API. The gap between laptop demo and department workhorse is configuration, not capability.

Systemd hardening

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"   # never 0.0.0.0
Environment="OLLAMA_NUM_PARALLEL=4"         # concurrent requests
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=24h"         # no cold starts at 9:00
ProtectSystem=full
NoNewPrivileges=true

# preload the workhorse at boot
ExecStartPost=/usr/bin/curl -s http://127.0.0.1:11434/api/generate \
  -d '{"model":"qwen2.5:7b-instruct-q4_K_M","keep_alive":"24h"}'

A reverse proxy your IT team will accept

Bind Ollama to localhost and put an authenticating proxy in front — TLS from your internal CA, an API key per department, request size limits, and structured logs for the audit trail:

import http from "node:http";
import { request } from "node:http";

const KEYS = new Set(process.env.API_KEYS.split(","));

http.createServer((req, res) => {
  if (!KEYS.has(req.headers["x-api-key"] ?? "")) {
    res.writeHead(401).end(); return;
  }
  const up = request(
    { host: "127.0.0.1", port: 11434, path: req.url,
      method: req.method, headers: req.headers },
    (u) => { res.writeHead(u.statusCode, u.headers); u.pipe(res); });
  req.pipe(up);                       // streams pass straight through
  console.log(JSON.stringify({ t: Date.now(), p: req.url,
    dept: req.headers["x-api-key"].slice(0, 8) }));
}).listen(8443);

Capacity math

One RTX 4090 (24 GB) serves a 7B-Q4 model at roughly 80–110 tokens/s for a single stream; with NUM_PARALLEL=4, four users share that throughput. For a 40-person department with bursty usage, one such box is typically enough — which is exactly the hardware story that makes the on-prem economics work.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment