Guides from the machine room.

How small models, RAG pipelines, humanoid robots and world models actually work — written by the team that deploys them behind firewalls, with code you can run on your own hardware.

[all 26] [Fine-Tuning] [Engineering] [Strategy] [Robotics] [Foundations] [Azure] [RAG]

Engineering 08 Jun 2026 9 min Python

Structured Outputs and Function Calling With Local Models: JSON You Can Trust

Grammar-constrained decoding, JSON schema enforcement and tool calling with small models — no cloud required.

Read the guide →

Engineering 07 Jun 2026 9 min C#

Local LLMs in .NET: Integrating On-Prem Models Into Enterprise C# Applications

Microsoft.Extensions.AI against an Ollama endpoint: dependency injection, streaming responses and structured output in a typical line-of-business app.

Read the guide →

Engineering 05 Jun 2026 11 min Go

A High-Performance Inference Gateway in Go: Routing, Queueing, Backpressure

Why Go is the right tool between your users and your GPU: a gateway with token streaming, fair queueing and circuit breaking.

Read the guide →

Engineering 02 Mar 2026 10 min Node.js

Build an On-Prem Support Chatbot with Node.js, Ollama and Your Ticket History

A streaming chat backend in Node.js: SSE, system prompts from your knowledge base, and zero external API calls.

Read the guide →

Engineering 23 Feb 2026 7 min Bash

Quantization Demystified: How a 7B Model Fits in 5 GB Without Getting Dumb

GGUF, Q4_K_M, perplexity deltas and the honest accuracy trade-offs of running compressed models on modest hardware.

Read the guide →

Engineering 09 Feb 2026 9 min Bash · Node.js

Running Ollama in Production: From Laptop Toy to Department Workhorse

Systemd hardening, model preloading, concurrency limits, and an Ollama reverse proxy that your IT department will sign off on.

Read the guide →