Why a gateway, and why Go
Between users and your GPU you need: authentication, fair queueing (the GPU serves N streams, not unlimited), timeouts, and metrics. Go's goroutines make a streaming proxy with backpressure almost embarrassingly short, and a single static binary deploys anywhere in the VPN.
package main
import (
"bytes"
"io"
"net/http"
"time"
)
// fair queue: GPU concurrency capped at 4, waiters time out
var slots = make(chan struct{}, 4)
func chat(w http.ResponseWriter, r *http.Request) {
if !validKey(r.Header.Get("X-Api-Key")) {
http.Error(w, "unauthorized", 401); return
}
select {
case slots <- struct{}{}: // acquired a GPU slot
defer func() { <-slots }()
case <-time.After(20 * time.Second):
http.Error(w, "busy — try again", 429); return
}
body, _ := io.ReadAll(http.MaxBytesReader(w, r.Body, 1<<20))
req, _ := http.NewRequestWithContext(r.Context(), "POST",
"http://127.0.0.1:11434/api/chat", bytes.NewReader(body))
resp, err := (&http.Client{Timeout: 0}).Do(req)
if err != nil { http.Error(w, "upstream down", 502); return }
defer resp.Body.Close()
w.Header().Set("Content-Type", "application/x-ndjson")
fl, _ := w.(http.Flusher)
buf := make([]byte, 4096)
for { // token streaming + flush
n, err := resp.Body.Read(buf)
if n > 0 { w.Write(buf[:n]); fl.Flush() }
if err != nil { return } // client gone or EOF: slot frees
}
}
func main() {
http.HandleFunc("/api/chat", chat)
http.ListenAndServe("127.0.0.1:8443", nil)
}
The three behaviors that matter
Backpressure: the buffered channel is the queue — request #5 waits, request #20 gets a clean 429 instead of melting the box. Cancellation: r.Context() propagates a closed browser tab upstream, so abandoned generations stop burning GPU seconds. Streaming: flush per read; time-to-first-token is the metric users feel.
What to add for production
Per-key token buckets, Prometheus counters (queue depth, TTFT, tokens/s), a circuit breaker that fails fast when Ollama restarts, and structured logs with request IDs but never message content. ~400 lines total in our deployed version — small enough that your IT team can read every line before it goes inside their network, which is precisely the point.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment