On-Prem AI vs. Cloud APIs: The Honest Cost Math for an SME

The two cost curves

Cloud APIs are a straight line through zero: every token billed, forever, scaling with success. On-prem is a step function: hardware once, then electricity and maintenance. The only honest question is where your volume puts you relative to the crossover.

A worked example: support assistant, 60-person company

Load: 400 requests/day, ~2,500 tokens each (prompt + completion) → ~30M tokens/month including RAG context.

Cloud (mid-tier frontier model): blended ≈ €4–8 per 1M tokens with retrieval-inflated prompts → €150–400/month, rising with every new use case, plus the compliance overhead of data leaving.

On-prem: one server with an RTX 4090-class GPU ≈ €4,000 capex; ~350 W average draw at €0.30/kWh ≈ €55/month; maintenance amortized ≈ €40/month → ≈ €95/month after hardware, flat regardless of volume.

# break-even months = capex / (cloud_monthly - onprem_monthly)
capex          = 4000
cloud_monthly  = 280          # mid estimate, current volume
onprem_monthly = 95

months = capex / (cloud_monthly - onprem_monthly)
print(f"break-even: {months:.1f} months")     # ~21.6 months

# now add the second use case on the SAME hardware:
cloud_monthly2 = 280 + 220    # marginal cloud cost of use case 2
print(f"with 2 use cases: {capex/(cloud_monthly2-95):.1f} months")
# ~9.9 months — marginal on-prem cost of use case 2 is ~zero

What the spreadsheet usually misses

For cloud: prompt bloat (RAG triples token counts), retries, and the legal cost of transfer assessments. For on-prem: someone must own the box (we price that honestly into our Business plan), and a frontier-model task that genuinely needs 100B+ reasoning shouldn't be forced onto a 7B.

The decision rule we give customers

Steady volume + sensitive data + tasks a fine-tuned ≤7B handles (most extraction, drafting, support and search) → on-prem wins within 6–24 months and compounds with every added use case. Spiky volume, public data, frontier reasoning → cloud, or the hybrid where cloud orchestrates and on-prem touches the documents. It's arithmetic, not ideology — we'll show you the sheet with your numbers in the first call.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment