Apr 18, 2026 · 5 min read

Last updated on Apr 19, 2026

Quantization Trade-offs in Production — 4-bit vs 8-bit vs Full Precision

Quantization reduces model precision to use less memory. A 70B model at full precision needs ~140 GB VRAM. At 4-bit, it needs ~35 GB. The question isn’t whether to quantize — it’s how much quality you’re willing to trade for the memory savings.

What quantization actually does

Neural network weights are stored as floating-point numbers. Full precision (FP16) uses 16 bits per parameter. Quantization maps these values to lower-precision representations:

FP16 → INT8: Map 65,536 possible values to 256 values
FP16 → INT4: Map 65,536 possible values to 16 values

The mapping isn’t random — quantization algorithms find the optimal way to represent the weight distribution with fewer bits. Better algorithms (like AWQ and GPTQ) analyze which weights matter most and preserve their precision.

Think of it like image compression: JPEG reduces file size by discarding information humans can’t easily perceive. Quantization reduces model size by discarding precision that doesn’t significantly affect outputs.

Quality impact at each level

Here’s what benchmarks consistently show across multiple model families:

Precision	Bits/param	Memory savings	Quality loss (avg)	Perplexity increase
FP16	16	Baseline	None	Baseline
INT8	8	50%	~0.5-1%	+0.01-0.03
Q6_K	6.5	60%	~1-2%	+0.02-0.05
Q5_K_M	5.5	66%	~2-3%	+0.05-0.10
Q4_K_M	4.5	72%	~3-5%	+0.10-0.20
Q4_0	4.0	75%	~5-7%	+0.15-0.30
Q3_K_M	3.5	78%	~7-10%	+0.30-0.60
Q2_K	2.5	84%	~15-25%	+1.0-2.0

The sweet spot for most use cases is Q4_K_M — it offers 72% memory savings with quality loss that’s barely perceptible in practice.

When quality loss actually matters

Not all tasks are equally sensitive to quantization:

Low sensitivity (quantize aggressively):

Code completion and generation
Summarization
Translation
General chat and Q&A
Following structured instructions

Medium sensitivity (use Q4_K_M or higher):

Complex reasoning chains
Mathematical proofs
Nuanced creative writing
Multi-step planning

High sensitivity (use Q6 or FP16):

Precise numerical computation
Tasks requiring exact recall of training data
Benchmarks and evaluations
Fine-tuning (always use FP16 or BF16)

For coding tasks specifically, Q4_K_M quantization typically loses less than 2% on HumanEval and similar benchmarks. The model still understands syntax, patterns, and logic — it just has slightly less precision in edge cases.

Benchmark data: real degradation numbers

Testing Qwen 3.5 27B across quantization levels:

Quant	HumanEval	MBPP	MT-Bench	MMLU	VRAM
FP16	82.3%	76.1%	8.7	79.2%	54 GB
Q8_0	82.1%	75.8%	8.6	79.0%	27 GB
Q6_K	81.5%	75.2%	8.6	78.5%	21 GB
Q5_K_M	80.8%	74.5%	8.5	77.8%	18 GB
Q4_K_M	79.6%	73.2%	8.4	76.9%	16 GB
Q3_K_M	76.2%	69.8%	8.1	74.1%	13 GB
Q2_K	68.4%	61.2%	7.3	67.5%	10 GB

Key observations:

Q8 to Q4_K_M: only 2.7% drop on HumanEval, but 38 GB VRAM saved
Q4_K_M to Q3_K_M: 3.4% drop — the cliff starts here
Q2_K: catastrophic degradation — avoid for anything serious

Choosing the right quantization for your use case

Rule of thumb: use the largest model that fits your VRAM at Q4_K_M.

A 27B model at Q4_K_M almost always outperforms a 7B model at FP16, despite using similar VRAM. Model size matters more than precision for most tasks.

Decision framework:

If VRAM >= model_size_fp16 → use FP16 (no reason to quantize)
If VRAM >= model_size / 2  → use Q8_0 (negligible quality loss)
If VRAM >= model_size / 3  → use Q4_K_M (best trade-off)
If VRAM < model_size / 3   → use a smaller model at Q4_K_M

Quantization methods compared

Different quantization algorithms produce different quality at the same bit width:

Method	Format	Best for	Quality	Speed
GGUF	.gguf	Ollama, llama.cpp	Good	Fast (CPU+GPU)
GPTQ	.safetensors	GPU inference (vLLM, TGI)	Good	Fast (GPU only)
AWQ	.safetensors	GPU inference	Slightly better	Fast (GPU only)
bitsandbytes	On-the-fly	Python/HuggingFace	Good	Moderate
AQLM	.safetensors	Extreme compression (2-bit)	Best at low bits	Slower

For local use with Ollama, GGUF is the standard. For GPU serving with vLLM, use AWQ or GPTQ. See our detailed format comparison.

Practical recommendations by hardware

Hardware	VRAM	Best strategy	Models that fit well
RTX 4090	24 GB	Q4_K_M	27B models comfortably
RTX 4080	16 GB	Q4_K_M	22B models
RTX 4070	12 GB	Q4_K_M	14B models, 22B tight
RTX 4060	8 GB	Q4_K_S	7-9B models
Mac M4 Pro	24 GB	Q4_K_M	27B models
Mac M4 Max	48-128 GB	Q6_K or Q8	70B+ models

For Mac users with unified memory, you can afford higher quantization levels since the memory pool is larger. A Mac with 48 GB can run a 70B model at Q4_K_M comfortably.

If your production workload needs more VRAM than your current hardware provides, cloud GPU providers offer A100 and H100 instances where you can run larger models at higher precision without buying new cards.

When NOT to quantize

Fine-tuning — Always train at FP16/BF16. Quantization errors compound during gradient updates.
Evaluation/benchmarking — Use FP16 for fair comparisons.
When you have the VRAM — If the full model fits, there’s no reason to quantize.
Embedding models — Quality loss from quantization affects retrieval accuracy more than generation quality.

FAQ

Does quantization make models worse?

Yes, but the degree depends on the level. At Q4_K_M (the most common choice), quality loss is typically 3-5% on benchmarks — barely noticeable in practice for coding, chat, and most tasks. The trade-off is worth it: you can run a 27B model on a 16 GB GPU instead of needing 54 GB. A larger quantized model almost always beats a smaller full-precision model.

What’s the best quantization for coding tasks?

Q4_K_M is the sweet spot for coding. Benchmarks show only 2-3% degradation on HumanEval and MBPP at this level. Code has strong structural patterns that survive quantization well — syntax, indentation, and common patterns are preserved. If you have extra VRAM, Q5_K_M or Q6_K give marginal improvements. Avoid Q3 and below for coding.

Can I quantize any model?

Most modern transformer models can be quantized. Pre-quantized versions are available on HuggingFace for popular models in GGUF, GPTQ, and AWQ formats. You can also quantize models yourself using llama.cpp (for GGUF) or AutoGPTQ/AutoAWQ (for GPU formats). Some architectures quantize better than others — models with GQA tend to be more robust to quantization than older MHA models.

Quantization Trade-offs in Production — 4-bit vs 8-bit vs Full Precision

What quantization actually does

Quality impact at each level

When quality loss actually matters

Benchmark data: real degradation numbers

Choosing the right quantization for your use case

Quantization methods compared

Practical recommendations by hardware

When NOT to quantize

FAQ

Does quantization make models worse?

What’s the best quantization for coding tasks?

Can I quantize any model?

📬 AI Dev Weekly

You might also like

Prefix Caching for LLM APIs — How It Works and Why It Saves Money

SGLang vs vLLM — The New Inference Engine Challenger (2026)

GPU Memory Planning for LLM Serving — How Much VRAM You Actually Need

How to Serve LLMs with vLLM — Production Deployment Guide