Fig. 1 — Inference stays inside. The boundary is not a policy. It is the architecture.
On-prem share
0%
Typical model size
≤0B params
Ongoing API cost
€0.00
Avg. deployment
0h
Process
From audit to go-live in three controlled steps.
01
Discovery
Use-case audit & feasibility
A structured workshop maps your use case, data landscape and compliance requirements. You leave with a concrete feasibility verdict and a fixed quote — no vague "it depends".
02
Tuning
Model selection & fine-tuning
We benchmark candidate models against your real documents and fine-tune the winner on your proprietary data — inside your environment or an air-gapped staging machine.
03
Sealing
VPN deployment & handover
The model goes live on your hardware, reachable only inside your VPN. We hand over weights, documentation, monitoring and a trained team. The system is yours — outright.
Services
Everything required to own your AI outright.
Core offering
Custom AI Model Deployment
A small open model — fine-tuned on your contracts, tickets, manuals or product data — running on a single GPU server in your rack. It answers in your terminology, follows your policies, and costs nothing per request. Most deployments replace four-figure monthly cloud-API bills with hardware you already own.
Use-case-specific fine-tuning on proprietary data
Runs on modest hardware — one workstation GPU is often enough
Full handover: weights, configs and documentation belong to you
SRV-02
VPN Infrastructure
We design and harden the network layer: WireGuard or IPsec tunnels, reverse proxies, certificates and access control — so the model is reachable for your team and invisible to everyone else.
SRV-03
Team Enablement
Hands-on workshops for your staff: prompt patterns for your specific model, escalation rules, and admin training so your IT team can operate and update the system independently.
SRV-04
Model Optimization
Quantization, speculative decoding and context-window tuning squeeze maximum throughput out of your hardware — often 3–5× faster inference without measurable quality loss.
Component catalogue
Small models. Serious capability.
We don't chase the biggest model — we specify the smallest one that solves your problem brilliantly. Current approved components:
Qwen 2.5
REF Q25 · 0.5–7B
The all-rounder. Outstanding multilingual quality — including German — and strong reasoning at tiny sizes. Default for assistants and document Q&A.
Reasoning
88
German fluency
92
Efficiency
90
Qwen 2.5 REF Q25 · 0.5–7B
License
Apache 2.0
VRAM (7B, Q4)
~5 GB
Context
128k tokens
Best for
Assistants, document Q&A, multilingual support
Mistral 7B
REF M7B · 7B
European engineering, Apache-licensed. Fast, predictable and a proven base for fine-tuning on industry-specific corpora.
Reasoning
84
Fine-tune fit
93
Efficiency
86
Mistral 7B REF M7B · 7B
License
Apache 2.0
VRAM (Q4)
~5 GB
Context
32k tokens
Best for
Industry fine-tunes, EU-sovereignty requirements
Phi-4 Mini
REF P4M · 3.8B
Punches far above its weight on logic and math. Ideal when reasoning quality matters and hardware budget is tight.
Reasoning
91
Math & logic
94
Efficiency
95
Phi-4 Mini REF P4M · 3.8B
License
MIT
VRAM (Q4)
~3 GB
Context
128k tokens
Best for
Reasoning on small hardware, structured extraction
Llama 3.2
REF L32 · 1–3B
Meta's compact line with a huge tooling ecosystem. Excellent for edge devices and lightweight internal copilots.
Reasoning
80
Ecosystem
96
Efficiency
92
Llama 3.2 REF L32 · 1–3B
License
Llama Community
VRAM (3B, Q4)
~2.5 GB
Context
128k tokens
Best for
Edge devices, lightweight copilots
DeepSeek R1
REF R1D · 1.5–14B
Distilled reasoning models that think step by step. Our pick for analysis, code review and complex multi-stage workflows.
Reasoning
96
Code & analysis
90
Efficiency
78
DeepSeek R1 REF R1D · 1.5–14B
License
MIT
VRAM (14B, Q4)
~10 GB
Context
64k tokens
Best for
Analysis, code review, multi-stage workflows
Gemma 3
REF G3 · 1–4B
Google DeepMind lineage with strong instruction following and vision variants. A solid pick for mixed text-and-image intake.
Reasoning
85
Vision intake
88
Efficiency
91
Gemma 3 REF G3 · 1–4B
License
Gemma Terms
VRAM (4B, Q4)
~3.5 GB
Context
128k tokens
Best for
Mixed text + image pipelines, form intake
SmolLM2
REF SL2 · 135M–1.7B
Hugging Face's fully open tiny models. When the task is narrow and latency is everything, smaller wins.
The retrieval backbone. Turns your documents into searchable vectors entirely on-prem — the foundation of every private RAG system we build.
Retrieval
89
Long context
87
Efficiency
98
nomic-embed REF NE1 · 137M
License
Apache 2.0
VRAM
<1 GB (CPU fine)
Context
8k tokens
Best for
Private RAG, semantic search, deduplication
Mixtral 8x7B
REF MX8 · 47B MoE
Mixture-of-experts: 47B of knowledge, only ~13B active per token. Near-large-model quality at mid-size speed — when a 7B tops out, this is the next rung.
Reasoning
93
Throughput
84
Efficiency
80
Mixtral 8x7B REF MX8 · 47B MoE
License
Apache 2.0
VRAM (Q4)
~28 GB
Context
32k tokens
Best for
Complex analysis, multi-step drafting, hard German prose
Mistral Small 3
REF MS3 · 24B
The dense workhorse above 7B: strong instruction following and function calling with single-GPU deployability on a 24 GB card.
Reasoning
92
Function calling
91
Efficiency
82
Mistral Small 3 REF MS3 · 24B
License
Apache 2.0
VRAM (Q4)
~15 GB
Context
32k tokens
Best for
Agent backends, demanding assistants, tool use
Qwen 2.5 Coder
REF QC7 · 1.5–7B
Purpose-trained on code: completion, review, SQL generation. Your developers get an assistant that never uploads a line of proprietary source.
AllenAI's fully open model — weights, training data and code all published. When auditors ask "what is this trained on?", this one has an answer.
Reasoning
86
Auditability
99
Efficiency
84
OLMo 2 REF OL2 · 7–13B
License
Apache 2.0
VRAM (7B, Q4)
~5 GB
Context
4k tokens
Best for
Maximum-transparency deployments, research-adjacent work
Whisper Large v3
REF WH3 · 1.5B
OpenAI's speech-to-text, MIT-licensed and excellent in German. Meetings, support calls and voice notes transcribed without audio ever leaving the building.
German STT
94
Noise robustness
88
Efficiency
86
Whisper Large v3 REF WH3 · 1.5B
License
MIT
VRAM
~3 GB
Modality
Audio → text, 99 languages
Best for
Meeting transcription, call analysis, dictation
BGE-M3
REF BG3 · 568M
The multilingual retrieval heavyweight: dense, sparse and multi-vector search in one model, 100+ languages. Our pick when German and English documents mix.
Multilingual retrieval
95
Hybrid search
93
Efficiency
90
BGE-M3 REF BG3 · 568M
License
MIT
VRAM
~1.5 GB (CPU fine)
Context
8k tokens
Best for
Multilingual RAG, hybrid dense+sparse retrieval
Security & compliance
GDPR isn't a feature here. It's the floor plan.
outbound-audit — your-server:~
Zero external APIs
No OpenAI, no cloud endpoints, no third-party processors. Inference happens on your silicon — there is simply no wire for data to leave on.
VPN-enclosed inference
The model is only reachable through your private network. Off-VPN, it doesn't exist. Access maps directly to your existing identity management.
GDPR by architecture
No data transfer means no transfer agreements, no US-cloud legal gymnastics, no Schrems headaches. Your DPO signs off in one meeting.
No telemetry
Nothing phones home — not the model, not the runtime, not our tooling. We verify it with outbound traffic audits and document it for your records.
Pricing
One-time engineering. Zero rent.
All prices net. After handover the model is yours — run it for years at the cost of electricity.
"Our support team answers tickets with a model trained on twelve years of our own resolutions. Quality went up, and our customer data never touched a cloud. The works council approved it in a single session."
Markus SteinerIT Director, machinery manufacturer · Stuttgart
"As a Swiss fiduciary we simply cannot send client documents to US providers. Localized AI gave us a contract-analysis assistant that runs in our own server room. Setup to go-live took four days."
"We replaced a €3,200/month API bill with a fine-tuned 7B model on a single GPU box. Same task quality for our German-language product texts, fully amortized within the first quarter."
Thomas EderHead of E-Commerce, retail group · Vienna
Fresh from the blog
What's inside a humanoid robot? Take it apart.
Our most popular guide is fully interactive: explore a humanoid component by component â actuators, IMU, the VLA brain, the physics of balance â each explained on hover or tap. Then see how the same model classes land in your business, minus the legs.
Tell us about your use case. We reply within one business day with an honest feasibility assessment — including "this doesn't need AI" when that's the truth.
localai — on-prem guideruns on this page · no cloud