On-premises AI · Built for the DACH Mittelstand

The AI that never leaves the building.

We fine-tune small open models on your data and install them inside your VPN. No cloud. No per-token bills. No data crossing your perimeter — ever.

Plan my deployment See the process

Fig. 1 — Inference stays inside. The boundary is not a policy. It is the architecture.

On-prem share: 0%
Typical model size: ≤0B params
Ongoing API cost: €0.00
Avg. deployment: 0h

Process

From audit to go-live in three controlled steps.

Discovery

Use-case audit & feasibility

A structured workshop maps your use case, data landscape and compliance requirements. You leave with a concrete feasibility verdict and a fixed quote — no vague "it depends".
Tuning

Model selection & fine-tuning

We benchmark candidate models against your real documents and fine-tune the winner on your proprietary data — inside your environment or an air-gapped staging machine.
Sealing

VPN deployment & handover

The model goes live on your hardware, reachable only inside your VPN. We hand over weights, documentation, monitoring and a trained team. The system is yours — outright.

Services

Everything required to own your AI outright.

Core offering

Custom AI Model Deployment

A small open model — fine-tuned on your contracts, tickets, manuals or product data — running on a single GPU server in your rack. It answers in your terminology, follows your policies, and costs nothing per request. Most deployments replace four-figure monthly cloud-API bills with hardware you already own.

Use-case-specific fine-tuning on proprietary data
Runs on modest hardware — one workstation GPU is often enough
Full handover: weights, configs and documentation belong to you

VPN Infrastructure

We design and harden the network layer: WireGuard or IPsec tunnels, reverse proxies, certificates and access control — so the model is reachable for your team and invisible to everyone else.

Team Enablement

Hands-on workshops for your staff: prompt patterns for your specific model, escalation rules, and admin training so your IT team can operate and update the system independently.

Model Optimization

Quantization, speculative decoding and context-window tuning squeeze maximum throughput out of your hardware — often 3–5× faster inference without measurable quality loss.

Component catalogue

Small models. Serious capability.

We don't chase the biggest model — we specify the smallest one that solves your problem brilliantly. Current approved components:

Qwen 2.5

REF Q25 · 0.5–7B

The all-rounder. Outstanding multilingual quality — including German — and strong reasoning at tiny sizes. Default for assistants and document Q&A.

Reasoning: 88
German fluency: 92
Efficiency: 90

Qwen 2.5 REF Q25 · 0.5–7B

License: Apache 2.0
VRAM (7B, Q4): ~5 GB
Context: 128k tokens
Best for: Assistants, document Q&A, multilingual support

Mistral 7B

REF M7B · 7B

European engineering, Apache-licensed. Fast, predictable and a proven base for fine-tuning on industry-specific corpora.

Reasoning: 84
Fine-tune fit: 93
Efficiency: 86

Mistral 7B REF M7B · 7B

License: Apache 2.0
VRAM (Q4): ~5 GB
Context: 32k tokens
Best for: Industry fine-tunes, EU-sovereignty requirements

Phi-4 Mini

REF P4M · 3.8B

Punches far above its weight on logic and math. Ideal when reasoning quality matters and hardware budget is tight.

Reasoning: 91
Math & logic: 94
Efficiency: 95

Phi-4 Mini REF P4M · 3.8B

License: MIT
VRAM (Q4): ~3 GB
Context: 128k tokens
Best for: Reasoning on small hardware, structured extraction

Llama 3.2

REF L32 · 1–3B

Meta's compact line with a huge tooling ecosystem. Excellent for edge devices and lightweight internal copilots.

Reasoning: 80
Ecosystem: 96
Efficiency: 92

Llama 3.2 REF L32 · 1–3B

License: Llama Community
VRAM (3B, Q4): ~2.5 GB
Context: 128k tokens
Best for: Edge devices, lightweight copilots

DeepSeek R1

REF R1D · 1.5–14B

Distilled reasoning models that think step by step. Our pick for analysis, code review and complex multi-stage workflows.

Reasoning: 96
Code & analysis: 90
Efficiency: 78

DeepSeek R1 REF R1D · 1.5–14B

License: MIT
VRAM (14B, Q4): ~10 GB
Context: 64k tokens
Best for: Analysis, code review, multi-stage workflows

Gemma 3

REF G3 · 1–4B

Google DeepMind lineage with strong instruction following and vision variants. A solid pick for mixed text-and-image intake.

Reasoning: 85
Vision intake: 88
Efficiency: 91

Gemma 3 REF G3 · 1–4B

License: Gemma Terms
VRAM (4B, Q4): ~3.5 GB
Context: 128k tokens
Best for: Mixed text + image pipelines, form intake

SmolLM2

REF SL2 · 135M–1.7B

Hugging Face's fully open tiny models. When the task is narrow and latency is everything, smaller wins.

Reasoning: 68
Latency: 99
Efficiency: 99

SmolLM2 REF SL2 · 135M–1.7B

License: Apache 2.0
VRAM (1.7B, Q4): ~1.5 GB
Context: 8k tokens
Best for: Classification, routing, autocomplete, CPU-only hosts

nomic-embed

REF NE1 · 137M

The retrieval backbone. Turns your documents into searchable vectors entirely on-prem — the foundation of every private RAG system we build.

Retrieval: 89
Long context: 87
Efficiency: 98

nomic-embed REF NE1 · 137M

License: Apache 2.0
VRAM: <1 GB (CPU fine)
Context: 8k tokens
Best for: Private RAG, semantic search, deduplication

Mixtral 8x7B

REF MX8 · 47B MoE

Mixture-of-experts: 47B of knowledge, only ~13B active per token. Near-large-model quality at mid-size speed — when a 7B tops out, this is the next rung.

Reasoning: 93
Throughput: 84
Efficiency: 80

Mixtral 8x7B REF MX8 · 47B MoE

License: Apache 2.0
VRAM (Q4): ~28 GB
Context: 32k tokens
Best for: Complex analysis, multi-step drafting, hard German prose

Mistral Small 3

REF MS3 · 24B

The dense workhorse above 7B: strong instruction following and function calling with single-GPU deployability on a 24 GB card.

Reasoning: 92
Function calling: 91
Efficiency: 82

Mistral Small 3 REF MS3 · 24B

License: Apache 2.0
VRAM (Q4): ~15 GB
Context: 32k tokens
Best for: Agent backends, demanding assistants, tool use

Qwen 2.5 Coder

REF QC7 · 1.5–7B

Purpose-trained on code: completion, review, SQL generation. Your developers get an assistant that never uploads a line of proprietary source.

Code quality: 92
SQL & APIs: 90
Efficiency: 90

Qwen 2.5 Coder REF QC7 · 1.5–7B

License: Apache 2.0
VRAM (7B, Q4): ~5 GB
Context: 128k tokens
Best for: In-house coding assistant, SQL generation, code review

IBM Granite 3

REF GR3 · 2–8B

Enterprise-bred: trained on curated, legally vetted data with published provenance. The model your legal department approves fastest.

RAG fidelity: 90
Compliance story: 96
Efficiency: 88

IBM Granite 3 REF GR3 · 2–8B

License: Apache 2.0
VRAM (8B, Q4): ~5.5 GB
Context: 128k tokens
Best for: Regulated industries, enterprise RAG, audited deployments

OLMo 2

REF OL2 · 7–13B

AllenAI's fully open model — weights, training data and code all published. When auditors ask "what is this trained on?", this one has an answer.

Reasoning: 86
Auditability: 99
Efficiency: 84

OLMo 2 REF OL2 · 7–13B

License: Apache 2.0
VRAM (7B, Q4): ~5 GB
Context: 4k tokens
Best for: Maximum-transparency deployments, research-adjacent work

Whisper Large v3

REF WH3 · 1.5B

OpenAI's speech-to-text, MIT-licensed and excellent in German. Meetings, support calls and voice notes transcribed without audio ever leaving the building.

German STT: 94
Noise robustness: 88
Efficiency: 86

Whisper Large v3 REF WH3 · 1.5B

License: MIT
VRAM: ~3 GB
Modality: Audio → text, 99 languages
Best for: Meeting transcription, call analysis, dictation

BGE-M3

REF BG3 · 568M

The multilingual retrieval heavyweight: dense, sparse and multi-vector search in one model, 100+ languages. Our pick when German and English documents mix.

Multilingual retrieval: 95
Hybrid search: 93
Efficiency: 90

BGE-M3 REF BG3 · 568M

License: MIT
VRAM: ~1.5 GB (CPU fine)
Context: 8k tokens
Best for: Multilingual RAG, hybrid dense+sparse retrieval

Security & compliance

GDPR isn't a feature here. It's the floor plan.

outbound-audit — your-server:~

Zero external APIs

No OpenAI, no cloud endpoints, no third-party processors. Inference happens on your silicon — there is simply no wire for data to leave on.

VPN-enclosed inference

The model is only reachable through your private network. Off-VPN, it doesn't exist. Access maps directly to your existing identity management.

GDPR by architecture

No data transfer means no transfer agreements, no US-cloud legal gymnastics, no Schrems headaches. Your DPO signs off in one meeting.

No telemetry

Nothing phones home — not the model, not the runtime, not our tooling. We verify it with outbound traffic audits and document it for your records.

Pricing

One-time engineering. Zero rent.

All prices net. After handover the model is yours — run it for years at the cost of electricity.

Starter

€1,990one-time

A focused proof of value for one use case.

1 fine-tuned model, 1 use case
Deployment on your existing hardware
Basic VPN access configuration
Handover documentation
30 days email support

Start with Starter

Recommended

Business

€4,490+ €490/mo

Production-grade private AI with ongoing care.

Up to 3 fine-tuned models / use cases
Private RAG over your document base
Hardened VPN infrastructure setup
Monitoring, updates & re-tuning monthly
Team workshop (half-day, on-site or remote)
Priority support, 8h response

Choose Business

Enterprise

Customscoped per project

Multi-site, multi-model, audit-heavy environments.

Unlimited models & use cases
High-availability & air-gapped options
Compliance documentation pack (GDPR, ISO)
Dedicated engineer & SLA
Custom integrations (ERP, DMS, ticketing)

Talk to us

Acceptance reports

Signed off across DACH.

"Our support team answers tickets with a model trained on twelve years of our own resolutions. Quality went up, and our customer data never touched a cloud. The works council approved it in a single session."

Markus Steiner IT Director, machinery manufacturer · Stuttgart

"As a Swiss fiduciary we simply cannot send client documents to US providers. Localized AI gave us a contract-analysis assistant that runs in our own server room. Setup to go-live took four days."

Claudia Brunner Managing Partner, fiduciary services · Zürich

"We replaced a €3,200/month API bill with a fine-tuned 7B model on a single GPU box. Same task quality for our German-language product texts, fully amortized within the first quarter."

Thomas Eder Head of E-Commerce, retail group · Vienna

Fresh from the blog

What's inside a humanoid robot? Take it apart.

Our most popular guide is fully interactive: explore a humanoid component by component â actuators, IMU, the VLA brain, the physics of balance â each explained on hover or tap. Then see how the same model classes land in your business, minus the legs.

Explore the robot All 26 guides

Work order

Request a deployment plan.

Tell us about your use case. We reply within one business day with an honest feasibility assessment — including "this doesn't need AI" when that's the truth.

Name

Company

Phone — optional

Use case

Message