GGUF vs GPTQ vs AWQ β LLM Quantization Formats Explained (2026)
You found a model on Hugging Face. The download page lists six variants: Q4_K_M.gguf, GPTQ-4bit, AWQ-4bit, EXL2-4.0bpw⦠and you have no idea which one to grab. This guide explains every major LLM quantization format in plain English so you can pick the right file in under two minutes.
What Is Quantization?
Large language models store their weights as 16-bit (FP16) or 32-bit (FP32) floating-point numbers. A 7B-parameter model in FP16 needs roughly 14 GB of memory β thatβs already more than most consumer GPUs can handle, and bigger models only get worse.
Quantization shrinks those weights to lower precision β 8-bit, 4-bit, even 2-bit integers β so the model fits in less RAM or VRAM. The trade-off is a small loss in output quality. Modern quantization methods are remarkably good at minimizing that loss, which is why running 70B models on a single GPU is now routine.
If you want a deeper look at the quality-vs-size trade-off in real deployments, see Quantization Trade-offs in Production.
GGUF β The Universal Format
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and every tool built on top of it β including Ollama and LM Studio.
Why it dominates the local-LLM scene:
- CPU + GPU hybrid inference. GGUF models can run entirely on CPU, entirely on GPU, or split layers across both. If your GPU has 8 GB of VRAM and the model needs 12 GB, llama.cpp offloads the overflow to system RAM automatically.
- Single-file packaging. Tokenizer, metadata, and weights are all inside one
.gguffile. Download it, point your tool at it, done. - Wide quantization menu. Youβll see names like
Q4_K_M,Q5_K_S,Q8_0, etc. The naming convention works like this:
| Label | Bits per weight (approx.) | Notes |
|---|---|---|
Q2_K | ~2.6 | Very lossy β last resort |
Q3_K_M | ~3.3 | Usable for drafts, not great |
Q4_K_M | ~4.8 | Sweet spot for most users |
Q5_K_M | ~5.5 | Noticeably better quality |
Q6_K | ~6.6 | Near-FP16 quality |
Q8_0 | 8.0 | Virtually lossless, large files |
The K means k-quant (a smarter quantization scheme that assigns more bits to important layers). The suffix _S / _M / _L stands for small, medium, or large β referring to how many layers get the higher-bit treatment. Q4_K_M is the default recommendation for most hardware.
Best for: Anyone running models locally, especially if you donβt have a high-end GPU. Check How Much VRAM Do You Need for AI? to size your setup.
GPTQ β The OG GPU Quantizer
GPTQ (Generative Pre-Trained Transformer Quantization) was one of the first post-training quantization methods that actually worked well at 4-bit. Published in 2022, it became the go-to format for GPU inference before AWQ and EXL2 arrived.
How it works in brief: GPTQ quantizes weights one layer at a time, using a small calibration dataset to minimize the error introduced by rounding. The result is a model that runs on GPU through frameworks like AutoGPTQ or Transformers with GPTQ backend support.
Key characteristics:
- GPU-only. No CPU fallback β the model must fit entirely in VRAM.
- Good quality at 4-bit. Slightly behind AWQ and EXL2 in head-to-head benchmarks, but the difference is small.
- Mature ecosystem. Widely supported in HuggingFace Transformers, text-generation-webui, and vLLM.
- Slower inference than AWQ in most benchmarks due to less efficient kernel implementations.
GPTQ is still perfectly usable, but for new projects most people reach for AWQ or EXL2 instead.
AWQ β Faster GPU Inference
AWQ (Activation-aware Weight Quantization) improves on GPTQ with a key insight: not all weights matter equally. AWQ identifies the most βsalientβ weights β the ones that carry the most activation magnitude β and keeps them at higher precision while aggressively quantizing the rest.
Why people prefer it over GPTQ:
- Faster inference. AWQ kernels (especially via vLLM and TensorRT-LLM) are significantly faster than GPTQ equivalents.
- Better quality at the same bit-width. The activation-aware approach preserves more model capability per bit.
- Great for serving. If youβre running an API endpoint with vLLM, AWQ is often the recommended quantization format.
- GPU-only, same as GPTQ.
AWQ has become the default choice for GPU-only 4-bit quantization in production and high-throughput scenarios.
EXL2 β Best Quality Per Bit
EXL2 is the quantization format used by ExLlamaV2, a highly optimized inference engine for NVIDIA GPUs. It takes a different approach: instead of fixed 4-bit or 8-bit, EXL2 uses variable bits-per-weight (bpw) across layers.
Youβll see files labeled 3.0bpw, 4.0bpw, 5.0bpw, etc. The quantizer allocates more bits to sensitive layers and fewer to redundant ones, squeezing out the best possible quality for a given file size.
- Highest quality per bit of any format in most independent benchmarks.
- Flexible sizing. You can quantize to any target bpw (e.g., 3.5, 4.65, 6.0) to exactly fill your available VRAM.
- Fast inference on NVIDIA GPUs, competitive with AWQ.
- Smaller ecosystem. Only works with ExLlamaV2 and tools that integrate it (like text-generation-webui). Not supported in vLLM or Ollama.
EXL2 is the enthusiastβs choice β if you have an NVIDIA GPU and want the absolute best quality for your VRAM budget, this is it.
Comparison Table
| Feature | GGUF | GPTQ | AWQ | EXL2 |
|---|---|---|---|---|
| Runs on CPU | β Yes | β No | β No | β No |
| Runs on GPU | β Yes | β Yes | β Yes | β NVIDIA only |
| CPU+GPU split | β Yes | β No | β No | β No |
| Quality at 4-bit | Good | Good | Better | Best |
| Inference speed | Moderate | Moderate | Fast | Fast |
| Variable bpw | Via k-quants | β No | β No | β Yes |
| Ecosystem | llama.cpp, Ollama, LM Studio | AutoGPTQ, vLLM, TGI | vLLM, TRT-LLM, AutoAWQ | ExLlamaV2 |
| Best for | Local / mixed hardware | Legacy GPU setups | GPU serving / APIs | Max quality enthusiasts |
Which Format Should You Pick?
Use this decision tree:
-
No dedicated GPU, or not enough VRAM to fit the whole model? β GGUF (
Q4_K_Mto start). Itβll use your CPU and whatever GPU you have. See Best AI Models Under 16 GB VRAM for model suggestions. -
NVIDIA GPU with enough VRAM, and you want the best quality? β EXL2 at the highest bpw that fits. Pair it with ExLlamaV2 or text-generation-webui.
-
Running a production API or high-throughput server? β AWQ. Itβs the fastest for batched serving via vLLM or similar engines.
-
Already using a GPTQ model and it works fine? β No need to switch. GPTQ is still solid β just not the first choice for new setups.
Quick rule of thumb: If youβre using Ollama or LM Studio, youβre already on GGUF. If youβre serving with vLLM, go AWQ. If youβre tweaking for maximum quality on your RTX card, try EXL2.
Related Reading
- Quantization Trade-offs in Production β when quality loss actually matters
- Ollama Complete Guide (2026) β easiest way to run GGUF models
- LM Studio Complete Guide β GUI-based GGUF runner
- How Much VRAM Do You Need for AI? β sizing your hardware
- Best AI Models Under 16 GB VRAM β practical model picks
- vLLM vs Ollama vs llama.cpp vs TGI β inference engine comparison