Quantization Demystified: How a 7B Model Fits in 5 GB Without Getting Dumb

The idea: fewer bits per weight

A 7B model in 16-bit floats needs ~14 GB. Quantization stores weights in 4–8 bits with per-block scale factors, cutting that to 4–8 GB with surprisingly little quality loss. GGUF is the container format llama.cpp and Ollama use; the suffix tells you the scheme.

Decoding the suffixes

Variant	Bits/weight	7B size	Verdict
Q8_0	8.5	~7.2 GB	Near-lossless; use if VRAM allows
Q5_K_M	5.5	~4.8 GB	Excellent quality/size balance
Q4_K_M	4.8	~4.4 GB	The production default
Q3_K_M	3.9	~3.5 GB	Noticeable degradation; edge only
Q2_K	2.6	~2.7 GB	Demos only — reasoning suffers

Quantizing your own fine-tune

# merge LoRA adapter into the base, then convert + quantize
python llama.cpp/convert_hf_to_gguf.py ./merged-model \
       --outfile support-v1-f16.gguf

./llama.cpp/build/bin/llama-quantize \
       support-v1-f16.gguf support-v1-q4_k_m.gguf Q4_K_M

# measure what you paid: perplexity delta on YOUR domain text
./llama.cpp/build/bin/llama-perplexity \
       -m support-v1-q4_k_m.gguf -f sample_company_docs.txt

The honest trade-off

Q4_K_M typically costs 1–3% on benchmark scores and is invisible in most business tasks — but the loss concentrates in long chains of reasoning and rare tokens (think: legal citations). Our rule: ship Q4_K_M, keep a Q8_0 of the same model for the eval harness, and promote any task that shows a measurable gap to the bigger quant. Disk is cheaper than wrong answers.

Want this running inside your own VPN?

Localized AI fine-tunes small open models on your data and deploys them on your hardware — GDPR by architecture, zero per-token costs. Average setup: 72 hours.

Plan my deployment