The idea: fewer bits per weight
A 7B model in 16-bit floats needs ~14 GB. Quantization stores weights in 4–8 bits with per-block scale factors, cutting that to 4–8 GB with surprisingly little quality loss. GGUF is the container format llama.cpp and Ollama use; the suffix tells you the scheme.
Decoding the suffixes
| Variant | Bits/weight | 7B size | Verdict |
|---|---|---|---|
| Q8_0 | 8.5 | ~7.2 GB | Near-lossless; use if VRAM allows |
| Q5_K_M | 5.5 | ~4.8 GB | Excellent quality/size balance |
| Q4_K_M | 4.8 | ~4.4 GB | The production default |
| Q3_K_M | 3.9 | ~3.5 GB | Noticeable degradation; edge only |
| Q2_K | 2.6 | ~2.7 GB | Demos only — reasoning suffers |
Quantizing your own fine-tune
# merge LoRA adapter into the base, then convert + quantize
python llama.cpp/convert_hf_to_gguf.py ./merged-model \
--outfile support-v1-f16.gguf
./llama.cpp/build/bin/llama-quantize \
support-v1-f16.gguf support-v1-q4_k_m.gguf Q4_K_M
# measure what you paid: perplexity delta on YOUR domain text
./llama.cpp/build/bin/llama-perplexity \
-m support-v1-q4_k_m.gguf -f sample_company_docs.txt
The honest trade-off
Q4_K_M typically costs 1–3% on benchmark scores and is invisible in most business tasks — but the loss concentrates in long chains of reasoning and rare tokens (think: legal citations). Our rule: ship Q4_K_M, keep a Q8_0 of the same model for the eval harness, and promote any task that shows a measurable gap to the bigger quant. Disk is cheaper than wrong answers.
Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.
Plan my deployment