May 9, 2026 · 5 min read

Last updated on Apr 20, 2026

GGUF vs GPTQ vs AWQ — LLM Quantization Formats Explained (2026)

You found a model on Hugging Face. The download page lists six variants: Q4_K_M.gguf, GPTQ-4bit, AWQ-4bit, EXL2-4.0bpw… and you have no idea which one to grab. This guide explains every major LLM quantization format in plain English so you can pick the right file in under two minutes.

What Is Quantization?

Large language models store their weights as 16-bit (FP16) or 32-bit (FP32) floating-point numbers. A 7B-parameter model in FP16 needs roughly 14 GB of memory — that’s already more than most consumer GPUs can handle, and bigger models only get worse.

Quantization shrinks those weights to lower precision — 8-bit, 4-bit, even 2-bit integers — so the model fits in less RAM or VRAM. The trade-off is a small loss in output quality. Modern quantization methods are remarkably good at minimizing that loss, which is why running 70B models on a single GPU is now routine.

If you want a deeper look at the quality-vs-size trade-off in real deployments, see Quantization Trade-offs in Production.

GGUF — The Universal Format

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and every tool built on top of it — including Ollama and LM Studio.

Why it dominates the local-LLM scene:

CPU + GPU hybrid inference. GGUF models can run entirely on CPU, entirely on GPU, or split layers across both. If your GPU has 8 GB of VRAM and the model needs 12 GB, llama.cpp offloads the overflow to system RAM automatically.
Single-file packaging. Tokenizer, metadata, and weights are all inside one .gguf file. Download it, point your tool at it, done.
Wide quantization menu. You’ll see names like Q4_K_M, Q5_K_S, Q8_0, etc. The naming convention works like this:

Label	Bits per weight (approx.)	Notes
`Q2_K`	~2.6	Very lossy — last resort
`Q3_K_M`	~3.3	Usable for drafts, not great
`Q4_K_M`	~4.8	Sweet spot for most users
`Q5_K_M`	~5.5	Noticeably better quality
`Q6_K`	~6.6	Near-FP16 quality
`Q8_0`	8.0	Virtually lossless, large files

The K means k-quant (a smarter quantization scheme that assigns more bits to important layers). The suffix _S / _M / _L stands for small, medium, or large — referring to how many layers get the higher-bit treatment. Q4_K_M is the default recommendation for most hardware.

Best for: Anyone running models locally, especially if you don’t have a high-end GPU. Check How Much VRAM Do You Need for AI? to size your setup.

GPTQ — The OG GPU Quantizer

GPTQ (Generative Pre-Trained Transformer Quantization) was one of the first post-training quantization methods that actually worked well at 4-bit. Published in 2022, it became the go-to format for GPU inference before AWQ and EXL2 arrived.

How it works in brief: GPTQ quantizes weights one layer at a time, using a small calibration dataset to minimize the error introduced by rounding. The result is a model that runs on GPU through frameworks like AutoGPTQ or Transformers with GPTQ backend support.

Key characteristics:

GPU-only. No CPU fallback — the model must fit entirely in VRAM.
Good quality at 4-bit. Slightly behind AWQ and EXL2 in head-to-head benchmarks, but the difference is small.
Mature ecosystem. Widely supported in HuggingFace Transformers, text-generation-webui, and vLLM.
Slower inference than AWQ in most benchmarks due to less efficient kernel implementations.

GPTQ is still perfectly usable, but for new projects most people reach for AWQ or EXL2 instead.

AWQ — Faster GPU Inference

AWQ (Activation-aware Weight Quantization) improves on GPTQ with a key insight: not all weights matter equally. AWQ identifies the most “salient” weights — the ones that carry the most activation magnitude — and keeps them at higher precision while aggressively quantizing the rest.

Why people prefer it over GPTQ:

Faster inference. AWQ kernels (especially via vLLM and TensorRT-LLM) are significantly faster than GPTQ equivalents.
Better quality at the same bit-width. The activation-aware approach preserves more model capability per bit.
Great for serving. If you’re running an API endpoint with vLLM, AWQ is often the recommended quantization format.
GPU-only, same as GPTQ.

AWQ has become the default choice for GPU-only 4-bit quantization in production and high-throughput scenarios.

EXL2 — Best Quality Per Bit

EXL2 is the quantization format used by ExLlamaV2, a highly optimized inference engine for NVIDIA GPUs. It takes a different approach: instead of fixed 4-bit or 8-bit, EXL2 uses variable bits-per-weight (bpw) across layers.

You’ll see files labeled 3.0bpw, 4.0bpw, 5.0bpw, etc. The quantizer allocates more bits to sensitive layers and fewer to redundant ones, squeezing out the best possible quality for a given file size.

Highest quality per bit of any format in most independent benchmarks.
Flexible sizing. You can quantize to any target bpw (e.g., 3.5, 4.65, 6.0) to exactly fill your available VRAM.
Fast inference on NVIDIA GPUs, competitive with AWQ.
Smaller ecosystem. Only works with ExLlamaV2 and tools that integrate it (like text-generation-webui). Not supported in vLLM or Ollama.

EXL2 is the enthusiast’s choice — if you have an NVIDIA GPU and want the absolute best quality for your VRAM budget, this is it.

Comparison Table

Feature	GGUF	GPTQ	AWQ	EXL2
Runs on CPU	✅ Yes	❌ No	❌ No	❌ No
Runs on GPU	✅ Yes	✅ Yes	✅ Yes	✅ NVIDIA only
CPU+GPU split	✅ Yes	❌ No	❌ No	❌ No
Quality at 4-bit	Good	Good	Better	Best
Inference speed	Moderate	Moderate	Fast	Fast
Variable bpw	Via k-quants	❌ No	❌ No	✅ Yes
Ecosystem	llama.cpp, Ollama, LM Studio	AutoGPTQ, vLLM, TGI	vLLM, TRT-LLM, AutoAWQ	ExLlamaV2
Best for	Local / mixed hardware	Legacy GPU setups	GPU serving / APIs	Max quality enthusiasts

Which Format Should You Pick?

Use this decision tree:

No dedicated GPU, or not enough VRAM to fit the whole model? → GGUF (Q4_K_M to start). It’ll use your CPU and whatever GPU you have. See Best AI Models Under 16 GB VRAM for model suggestions.
NVIDIA GPU with enough VRAM, and you want the best quality? → EXL2 at the highest bpw that fits. Pair it with ExLlamaV2 or text-generation-webui.
Running a production API or high-throughput server? → AWQ. It’s the fastest for batched serving via vLLM or similar engines.
Already using a GPTQ model and it works fine? → No need to switch. GPTQ is still solid — just not the first choice for new setups.

Quick rule of thumb: If you’re using Ollama or LM Studio, you’re already on GGUF. If you’re serving with vLLM, go AWQ. If you’re tweaking for maximum quality on your RTX card, try EXL2.

Quantization Trade-offs in Production — when quality loss actually matters
Ollama Complete Guide (2026) — easiest way to run GGUF models
LM Studio Complete Guide — GUI-based GGUF runner
How Much VRAM Do You Need for AI? — sizing your hardware
Best AI Models Under 16 GB VRAM — practical model picks
vLLM vs Ollama vs llama.cpp vs TGI — inference engine comparison

GGUF vs GPTQ vs AWQ — LLM Quantization Formats Explained (2026)

What Is Quantization?

GGUF — The Universal Format

GPTQ — The OG GPU Quantizer

AWQ — Faster GPU Inference

EXL2 — Best Quality Per Bit

Comparison Table

Which Format Should You Pick?

Related Reading

📬 AI Dev Weekly

You might also like

Best Hardware for a Local AI Smart Home (2026)

Qwen 3.6-35B-A3B: 73.4% SWE-bench With Only 3B Active Params — Runs on a Laptop (2026)

AI Model Leaderboards Explained — LMSYS, SWE-bench, HumanEval, and More (2026)

Multi-Model Architecture — When to Use Different AI Models for Different Tasks (2026)