A High-Performance Inference Gateway in Go: Routing, Queueing, Backpressure

Why a gateway, and why Go

Between users and your GPU you need: authentication, fair queueing (the GPU serves N streams, not unlimited), timeouts, and metrics. Go's goroutines make a streaming proxy with backpressure almost embarrassingly short, and a single static binary deploys anywhere in the VPN.

package main

import (
    "bytes"
    "io"
    "net/http"
    "time"
)

// fair queue: GPU concurrency capped at 4, waiters time out
var slots = make(chan struct{}, 4)

func chat(w http.ResponseWriter, r *http.Request) {
    if !validKey(r.Header.Get("X-Api-Key")) {
        http.Error(w, "unauthorized", 401); return
    }

    select {
    case slots <- struct{}{}:           // acquired a GPU slot
        defer func() { <-slots }()
    case <-time.After(20 * time.Second):
        http.Error(w, "busy — try again", 429); return
    }

    body, _ := io.ReadAll(http.MaxBytesReader(w, r.Body, 1<<20))
    req, _ := http.NewRequestWithContext(r.Context(), "POST",
        "http://127.0.0.1:11434/api/chat", bytes.NewReader(body))

    resp, err := (&http.Client{Timeout: 0}).Do(req)
    if err != nil { http.Error(w, "upstream down", 502); return }
    defer resp.Body.Close()

    w.Header().Set("Content-Type", "application/x-ndjson")
    fl, _ := w.(http.Flusher)
    buf := make([]byte, 4096)
    for {                                // token streaming + flush
        n, err := resp.Body.Read(buf)
        if n > 0 { w.Write(buf[:n]); fl.Flush() }
        if err != nil { return }         // client gone or EOF: slot frees
    }
}

func main() {
    http.HandleFunc("/api/chat", chat)
    http.ListenAndServe("127.0.0.1:8443", nil)
}

The three behaviors that matter

Backpressure: the buffered channel is the queue — request #5 waits, request #20 gets a clean 429 instead of melting the box. Cancellation: r.Context() propagates a closed browser tab upstream, so abandoned generations stop burning GPU seconds. Streaming: flush per read; time-to-first-token is the metric users feel.

What to add for production

Per-key token buckets, Prometheus counters (queue depth, TTFT, tokens/s), a circuit breaker that fails fast when Ollama restarts, and structured logs with request IDs but never message content. ~400 lines total in our deployed version — small enough that your IT team can read every line before it goes inside their network, which is precisely the point.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment