Quantization Trade-offs in Production — 4-bit vs 8-bit vs Full Precision
Quantization reduces model precision to use less memory. A 70B model at full precision needs ~140 GB VRAM. At 4-bit, it needs ~35 GB. The question isn’t whether to quantize — it’s how much quality you’re willing to trade for the memory savings.
What quantization actually does
Neural network weights are stored as floating-point numbers. Full precision (FP16) uses 16 bits per parameter. Quantization maps these values to lower-precision representations:
- FP16 → INT8: Map 65,536 possible values to 256 values
- FP16 → INT4: Map 65,536 possible values to 16 values
The mapping isn’t random — quantization algorithms find the optimal way to represent the weight distribution with fewer bits. Better algorithms (like AWQ and GPTQ) analyze which weights matter most and preserve their precision.
Think of it like image compression: JPEG reduces file size by discarding information humans can’t easily perceive. Quantization reduces model size by discarding precision that doesn’t significantly affect outputs.
Quality impact at each level
Here’s what benchmarks consistently show across multiple model families:
| Precision | Bits/param | Memory savings | Quality loss (avg) | Perplexity increase |
|---|---|---|---|---|
| FP16 | 16 | Baseline | None | Baseline |
| INT8 | 8 | 50% | ~0.5-1% | +0.01-0.03 |
| Q6_K | 6.5 | 60% | ~1-2% | +0.02-0.05 |
| Q5_K_M | 5.5 | 66% | ~2-3% | +0.05-0.10 |
| Q4_K_M | 4.5 | 72% | ~3-5% | +0.10-0.20 |
| Q4_0 | 4.0 | 75% | ~5-7% | +0.15-0.30 |
| Q3_K_M | 3.5 | 78% | ~7-10% | +0.30-0.60 |
| Q2_K | 2.5 | 84% | ~15-25% | +1.0-2.0 |
The sweet spot for most use cases is Q4_K_M — it offers 72% memory savings with quality loss that’s barely perceptible in practice.
When quality loss actually matters
Not all tasks are equally sensitive to quantization:
Low sensitivity (quantize aggressively):
- Code completion and generation
- Summarization
- Translation
- General chat and Q&A
- Following structured instructions
Medium sensitivity (use Q4_K_M or higher):
- Complex reasoning chains
- Mathematical proofs
- Nuanced creative writing
- Multi-step planning
High sensitivity (use Q6 or FP16):
- Precise numerical computation
- Tasks requiring exact recall of training data
- Benchmarks and evaluations
- Fine-tuning (always use FP16 or BF16)
For coding tasks specifically, Q4_K_M quantization typically loses less than 2% on HumanEval and similar benchmarks. The model still understands syntax, patterns, and logic — it just has slightly less precision in edge cases.
Benchmark data: real degradation numbers
Testing Qwen 3.5 27B across quantization levels:
| Quant | HumanEval | MBPP | MT-Bench | MMLU | VRAM |
|---|---|---|---|---|---|
| FP16 | 82.3% | 76.1% | 8.7 | 79.2% | 54 GB |
| Q8_0 | 82.1% | 75.8% | 8.6 | 79.0% | 27 GB |
| Q6_K | 81.5% | 75.2% | 8.6 | 78.5% | 21 GB |
| Q5_K_M | 80.8% | 74.5% | 8.5 | 77.8% | 18 GB |
| Q4_K_M | 79.6% | 73.2% | 8.4 | 76.9% | 16 GB |
| Q3_K_M | 76.2% | 69.8% | 8.1 | 74.1% | 13 GB |
| Q2_K | 68.4% | 61.2% | 7.3 | 67.5% | 10 GB |
Key observations:
- Q8 to Q4_K_M: only 2.7% drop on HumanEval, but 38 GB VRAM saved
- Q4_K_M to Q3_K_M: 3.4% drop — the cliff starts here
- Q2_K: catastrophic degradation — avoid for anything serious
Choosing the right quantization for your use case
Rule of thumb: use the largest model that fits your VRAM at Q4_K_M.
A 27B model at Q4_K_M almost always outperforms a 7B model at FP16, despite using similar VRAM. Model size matters more than precision for most tasks.
Decision framework:
If VRAM >= model_size_fp16 → use FP16 (no reason to quantize)
If VRAM >= model_size / 2 → use Q8_0 (negligible quality loss)
If VRAM >= model_size / 3 → use Q4_K_M (best trade-off)
If VRAM < model_size / 3 → use a smaller model at Q4_K_M
Quantization methods compared
Different quantization algorithms produce different quality at the same bit width:
| Method | Format | Best for | Quality | Speed |
|---|---|---|---|---|
| GGUF | .gguf | Ollama, llama.cpp | Good | Fast (CPU+GPU) |
| GPTQ | .safetensors | GPU inference (vLLM, TGI) | Good | Fast (GPU only) |
| AWQ | .safetensors | GPU inference | Slightly better | Fast (GPU only) |
| bitsandbytes | On-the-fly | Python/HuggingFace | Good | Moderate |
| AQLM | .safetensors | Extreme compression (2-bit) | Best at low bits | Slower |
For local use with Ollama, GGUF is the standard. For GPU serving with vLLM, use AWQ or GPTQ. See our detailed format comparison.
Practical recommendations by hardware
| Hardware | VRAM | Best strategy | Models that fit well |
|---|---|---|---|
| RTX 4090 | 24 GB | Q4_K_M | 27B models comfortably |
| RTX 4080 | 16 GB | Q4_K_M | 22B models |
| RTX 4070 | 12 GB | Q4_K_M | 14B models, 22B tight |
| RTX 4060 | 8 GB | Q4_K_S | 7-9B models |
| Mac M4 Pro | 24 GB | Q4_K_M | 27B models |
| Mac M4 Max | 48-128 GB | Q6_K or Q8 | 70B+ models |
For Mac users with unified memory, you can afford higher quantization levels since the memory pool is larger. A Mac with 48 GB can run a 70B model at Q4_K_M comfortably.
When NOT to quantize
- Fine-tuning — Always train at FP16/BF16. Quantization errors compound during gradient updates.
- Evaluation/benchmarking — Use FP16 for fair comparisons.
- When you have the VRAM — If the full model fits, there’s no reason to quantize.
- Embedding models — Quality loss from quantization affects retrieval accuracy more than generation quality.
FAQ
Does quantization make models worse?
Yes, but the degree depends on the level. At Q4_K_M (the most common choice), quality loss is typically 3-5% on benchmarks — barely noticeable in practice for coding, chat, and most tasks. The trade-off is worth it: you can run a 27B model on a 16 GB GPU instead of needing 54 GB. A larger quantized model almost always beats a smaller full-precision model.
What’s the best quantization for coding tasks?
Q4_K_M is the sweet spot for coding. Benchmarks show only 2-3% degradation on HumanEval and MBPP at this level. Code has strong structural patterns that survive quantization well — syntax, indentation, and common patterns are preserved. If you have extra VRAM, Q5_K_M or Q6_K give marginal improvements. Avoid Q3 and below for coding.
Can I quantize any model?
Most modern transformer models can be quantized. Pre-quantized versions are available on HuggingFace for popular models in GGUF, GPTQ, and AWQ formats. You can also quantize models yourself using llama.cpp (for GGUF) or AutoGPTQ/AutoAWQ (for GPU formats). Some architectures quantize better than others — models with GQA tend to be more robust to quantization than older MHA models.
Related: GGUF vs GPTQ vs AWQ Formats · How Much VRAM for AI? · Best AI Models Under 16GB VRAM · Ollama Complete Guide