The idea: fewer bits per weight

A 7B model in 16-bit floats needs ~14 GB. Quantization stores weights in 4–8 bits with per-block scale factors, cutting that to 4–8 GB with surprisingly little quality loss. GGUF is the container format llama.cpp and Ollama use; the suffix tells you the scheme.

Decoding the suffixes

VariantBits/weight7B sizeVerdict
Q8_08.5~7.2 GBNear-lossless; use if VRAM allows
Q5_K_M5.5~4.8 GBExcellent quality/size balance
Q4_K_M4.8~4.4 GBThe production default
Q3_K_M3.9~3.5 GBNoticeable degradation; edge only
Q2_K2.6~2.7 GBDemos only — reasoning suffers

Quantizing your own fine-tune

# merge LoRA adapter into the base, then convert + quantize
python llama.cpp/convert_hf_to_gguf.py ./merged-model \
       --outfile support-v1-f16.gguf

./llama.cpp/build/bin/llama-quantize \
       support-v1-f16.gguf support-v1-q4_k_m.gguf Q4_K_M

# measure what you paid: perplexity delta on YOUR domain text
./llama.cpp/build/bin/llama-perplexity \
       -m support-v1-q4_k_m.gguf -f sample_company_docs.txt

The honest trade-off

Q4_K_M typically costs 1–3% on benchmark scores and is invisible in most business tasks — but the loss concentrates in long chains of reasoning and rare tokens (think: legal citations). Our rule: ship Q4_K_M, keep a Q8_0 of the same model for the eval harness, and promote any task that shows a measurable gap to the bigger quant. Disk is cheaper than wrong answers.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment