The two cost curves
Cloud APIs are a straight line through zero: every token billed, forever, scaling with success. On-prem is a step function: hardware once, then electricity and maintenance. The only honest question is where your volume puts you relative to the crossover.
A worked example: support assistant, 60-person company
Load: 400 requests/day, ~2,500 tokens each (prompt + completion) → ~30M tokens/month including RAG context.
Cloud (mid-tier frontier model): blended ≈ €4–8 per 1M tokens with retrieval-inflated prompts → €150–400/month, rising with every new use case, plus the compliance overhead of data leaving.
On-prem: one server with an RTX 4090-class GPU ≈ €4,000 capex; ~350 W average draw at €0.30/kWh ≈ €55/month; maintenance amortized ≈ €40/month → ≈ €95/month after hardware, flat regardless of volume.
# break-even months = capex / (cloud_monthly - onprem_monthly)
capex = 4000
cloud_monthly = 280 # mid estimate, current volume
onprem_monthly = 95
months = capex / (cloud_monthly - onprem_monthly)
print(f"break-even: {months:.1f} months") # ~21.6 months
# now add the second use case on the SAME hardware:
cloud_monthly2 = 280 + 220 # marginal cloud cost of use case 2
print(f"with 2 use cases: {capex/(cloud_monthly2-95):.1f} months")
# ~9.9 months — marginal on-prem cost of use case 2 is ~zero
What the spreadsheet usually misses
For cloud: prompt bloat (RAG triples token counts), retries, and the legal cost of transfer assessments. For on-prem: someone must own the box (we price that honestly into our Business plan), and a frontier-model task that genuinely needs 100B+ reasoning shouldn't be forced onto a 7B.
The decision rule we give customers
Steady volume + sensitive data + tasks a fine-tuned ≤7B handles (most extraction, drafting, support and search) → on-prem wins within 6–24 months and compounds with every added use case. Spiky volume, public data, frontier reasoning → cloud, or the hybrid where cloud orchestrates and on-prem touches the documents. It's arithmetic, not ideology — we'll show you the sheet with your numbers in the first call.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment