On-premises AI · Built for the DACH Mittelstand

The AI that never leaves the building.

We fine-tune small open models on your data and install them inside your VPN. No cloud. No per-token bills. No data crossing your perimeter — ever.

PUBLIC INTERNET external AI APIs 0 bytes out YOUR PERIMETER · VPN your model ERP Documents Team Ticketing
Fig. 1 — Inference stays inside. The boundary is not a policy. It is the architecture.
On-prem share
0%
Typical model size
0B params
Ongoing API cost
0.00
Avg. deployment
0h

Process

From audit to go-live in three controlled steps.

  1. Discovery

    Use-case audit & feasibility

    A structured workshop maps your use case, data landscape and compliance requirements. You leave with a concrete feasibility verdict and a fixed quote — no vague "it depends".

  2. Tuning

    Model selection & fine-tuning

    We benchmark candidate models against your real documents and fine-tune the winner on your proprietary data — inside your environment or an air-gapped staging machine.

  3. Sealing

    VPN deployment & handover

    The model goes live on your hardware, reachable only inside your VPN. We hand over weights, documentation, monitoring and a trained team. The system is yours — outright.

Services

Everything required to own your AI outright.

VPN Infrastructure

We design and harden the network layer: WireGuard or IPsec tunnels, reverse proxies, certificates and access control — so the model is reachable for your team and invisible to everyone else.

Team Enablement

Hands-on workshops for your staff: prompt patterns for your specific model, escalation rules, and admin training so your IT team can operate and update the system independently.

Model Optimization

Quantization, speculative decoding and context-window tuning squeeze maximum throughput out of your hardware — often 3–5× faster inference without measurable quality loss.

Component catalogue

Small models. Serious capability.

We don't chase the biggest model — we specify the smallest one that solves your problem brilliantly. Current approved components:

Qwen 2.5

REF Q25 · 0.5–7B

The all-rounder. Outstanding multilingual quality — including German — and strong reasoning at tiny sizes. Default for assistants and document Q&A.

Reasoning
88
German fluency
92
Efficiency
90

Qwen 2.5 REF Q25 · 0.5–7B

License
Apache 2.0
VRAM (7B, Q4)
~5 GB
Context
128k tokens
Best for
Assistants, document Q&A, multilingual support

Mistral 7B

REF M7B · 7B

European engineering, Apache-licensed. Fast, predictable and a proven base for fine-tuning on industry-specific corpora.

Reasoning
84
Fine-tune fit
93
Efficiency
86

Mistral 7B REF M7B · 7B

License
Apache 2.0
VRAM (Q4)
~5 GB
Context
32k tokens
Best for
Industry fine-tunes, EU-sovereignty requirements

Phi-4 Mini

REF P4M · 3.8B

Punches far above its weight on logic and math. Ideal when reasoning quality matters and hardware budget is tight.

Reasoning
91
Math & logic
94
Efficiency
95

Phi-4 Mini REF P4M · 3.8B

License
MIT
VRAM (Q4)
~3 GB
Context
128k tokens
Best for
Reasoning on small hardware, structured extraction

Llama 3.2

REF L32 · 1–3B

Meta's compact line with a huge tooling ecosystem. Excellent for edge devices and lightweight internal copilots.

Reasoning
80
Ecosystem
96
Efficiency
92

Llama 3.2 REF L32 · 1–3B

License
Llama Community
VRAM (3B, Q4)
~2.5 GB
Context
128k tokens
Best for
Edge devices, lightweight copilots

DeepSeek R1

REF R1D · 1.5–14B

Distilled reasoning models that think step by step. Our pick for analysis, code review and complex multi-stage workflows.

Reasoning
96
Code & analysis
90
Efficiency
78

DeepSeek R1 REF R1D · 1.5–14B

License
MIT
VRAM (14B, Q4)
~10 GB
Context
64k tokens
Best for
Analysis, code review, multi-stage workflows

Gemma 3

REF G3 · 1–4B

Google DeepMind lineage with strong instruction following and vision variants. A solid pick for mixed text-and-image intake.

Reasoning
85
Vision intake
88
Efficiency
91

Gemma 3 REF G3 · 1–4B

License
Gemma Terms
VRAM (4B, Q4)
~3.5 GB
Context
128k tokens
Best for
Mixed text + image pipelines, form intake

SmolLM2

REF SL2 · 135M–1.7B

Hugging Face's fully open tiny models. When the task is narrow and latency is everything, smaller wins.

Reasoning
68
Latency
99
Efficiency
99

SmolLM2 REF SL2 · 135M–1.7B

License
Apache 2.0
VRAM (1.7B, Q4)
~1.5 GB
Context
8k tokens
Best for
Classification, routing, autocomplete, CPU-only hosts

nomic-embed

REF NE1 · 137M

The retrieval backbone. Turns your documents into searchable vectors entirely on-prem — the foundation of every private RAG system we build.

Retrieval
89
Long context
87
Efficiency
98

nomic-embed REF NE1 · 137M

License
Apache 2.0
VRAM
<1 GB (CPU fine)
Context
8k tokens
Best for
Private RAG, semantic search, deduplication

Mixtral 8x7B

REF MX8 · 47B MoE

Mixture-of-experts: 47B of knowledge, only ~13B active per token. Near-large-model quality at mid-size speed — when a 7B tops out, this is the next rung.

Reasoning
93
Throughput
84
Efficiency
80

Mixtral 8x7B REF MX8 · 47B MoE

License
Apache 2.0
VRAM (Q4)
~28 GB
Context
32k tokens
Best for
Complex analysis, multi-step drafting, hard German prose

Mistral Small 3

REF MS3 · 24B

The dense workhorse above 7B: strong instruction following and function calling with single-GPU deployability on a 24 GB card.

Reasoning
92
Function calling
91
Efficiency
82

Mistral Small 3 REF MS3 · 24B

License
Apache 2.0
VRAM (Q4)
~15 GB
Context
32k tokens
Best for
Agent backends, demanding assistants, tool use

Qwen 2.5 Coder

REF QC7 · 1.5–7B

Purpose-trained on code: completion, review, SQL generation. Your developers get an assistant that never uploads a line of proprietary source.

Code quality
92
SQL & APIs
90
Efficiency
90

Qwen 2.5 Coder REF QC7 · 1.5–7B

License
Apache 2.0
VRAM (7B, Q4)
~5 GB
Context
128k tokens
Best for
In-house coding assistant, SQL generation, code review

IBM Granite 3

REF GR3 · 2–8B

Enterprise-bred: trained on curated, legally vetted data with published provenance. The model your legal department approves fastest.

RAG fidelity
90
Compliance story
96
Efficiency
88

IBM Granite 3 REF GR3 · 2–8B

License
Apache 2.0
VRAM (8B, Q4)
~5.5 GB
Context
128k tokens
Best for
Regulated industries, enterprise RAG, audited deployments

OLMo 2

REF OL2 · 7–13B

AllenAI's fully open model — weights, training data and code all published. When auditors ask "what is this trained on?", this one has an answer.

Reasoning
86
Auditability
99
Efficiency
84

OLMo 2 REF OL2 · 7–13B

License
Apache 2.0
VRAM (7B, Q4)
~5 GB
Context
4k tokens
Best for
Maximum-transparency deployments, research-adjacent work

Whisper Large v3

REF WH3 · 1.5B

OpenAI's speech-to-text, MIT-licensed and excellent in German. Meetings, support calls and voice notes transcribed without audio ever leaving the building.

German STT
94
Noise robustness
88
Efficiency
86

Whisper Large v3 REF WH3 · 1.5B

License
MIT
VRAM
~3 GB
Modality
Audio → text, 99 languages
Best for
Meeting transcription, call analysis, dictation

BGE-M3

REF BG3 · 568M

The multilingual retrieval heavyweight: dense, sparse and multi-vector search in one model, 100+ languages. Our pick when German and English documents mix.

Multilingual retrieval
95
Hybrid search
93
Efficiency
90

BGE-M3 REF BG3 · 568M

License
MIT
VRAM
~1.5 GB (CPU fine)
Context
8k tokens
Best for
Multilingual RAG, hybrid dense+sparse retrieval

Security & compliance

GDPR isn't a feature here. It's the floor plan.

outbound-audit — your-server:~

      

Zero external APIs

No OpenAI, no cloud endpoints, no third-party processors. Inference happens on your silicon — there is simply no wire for data to leave on.

VPN-enclosed inference

The model is only reachable through your private network. Off-VPN, it doesn't exist. Access maps directly to your existing identity management.

GDPR by architecture

No data transfer means no transfer agreements, no US-cloud legal gymnastics, no Schrems headaches. Your DPO signs off in one meeting.

No telemetry

Nothing phones home — not the model, not the runtime, not our tooling. We verify it with outbound traffic audits and document it for your records.

Pricing

One-time engineering. Zero rent.

All prices net. After handover the model is yours — run it for years at the cost of electricity.

Starter

€1,990one-time

A focused proof of value for one use case.

  • 1 fine-tuned model, 1 use case
  • Deployment on your existing hardware
  • Basic VPN access configuration
  • Handover documentation
  • 30 days email support
Start with Starter

Enterprise

Customscoped per project

Multi-site, multi-model, audit-heavy environments.

  • Unlimited models & use cases
  • High-availability & air-gapped options
  • Compliance documentation pack (GDPR, ISO)
  • Dedicated engineer & SLA
  • Custom integrations (ERP, DMS, ticketing)
Talk to us

Acceptance reports

Signed off across DACH.

"Our support team answers tickets with a model trained on twelve years of our own resolutions. Quality went up, and our customer data never touched a cloud. The works council approved it in a single session."

Markus Steiner IT Director, machinery manufacturer · Stuttgart

"As a Swiss fiduciary we simply cannot send client documents to US providers. Localized AI gave us a contract-analysis assistant that runs in our own server room. Setup to go-live took four days."

Claudia Brunner Managing Partner, fiduciary services · Zürich

"We replaced a €3,200/month API bill with a fine-tuned 7B model on a single GPU box. Same task quality for our German-language product texts, fully amortized within the first quarter."

Thomas Eder Head of E-Commerce, retail group · Vienna

Fresh from the blog

What's inside a humanoid robot? Take it apart.

Our most popular guide is fully interactive: explore a humanoid component by component — actuators, IMU, the VLA brain, the physics of balance — each explained on hover or tap. Then see how the same model classes land in your business, minus the legs.

Work order

Request a deployment plan.

Tell us about your use case. We reply within one business day with an honest feasibility assessment — including "this doesn't need AI" when that's the truth.