Llama 4 Maverick is Metaβs largest open model β 400 billion parameters with a 10 million token context window. Running it locally is possible but requires serious hardware. This guide covers your options.
What youβre working with
| Spec | Value |
|---|---|
| Parameters | 400B total (MoE) |
| Active parameters | ~100B per inference |
| Context window | 10M tokens |
| License | Llama License |
| Modalities | Text + Image |
| GGUF available | β |
For comparison, Llama 4 Scout (the smaller sibling) has 17B active parameters and runs on a single GPU. Maverick is a different beast entirely.
Hardware requirements
Minimum viable setup
| Quantization | Total RAM/VRAM needed | Example setup |
|---|---|---|
| FP16 | ~800 GB | 10x A100 80GB |
| Q8 | ~400 GB | 5x A100 80GB |
| Q4_K_M | ~200 GB | 3x A100 80GB or 4x RTX 4090 |
| Q3_K_M | ~150 GB | 2x A100 80GB or 3x RTX 4090 |
| Q2_K | ~100 GB | 2x RTX 4090 (48GB each) |
The realistic minimum for home use: 2-3x RTX 4090 (24GB each) with Q3 or Q2 quantization. Thatβs $3,000-4,500 in GPUs alone.
The realistic minimum for quality: 4x RTX 4090 or 2x A100 with Q4_K_M quantization. This preserves most of the modelβs capability.
If you donβt have multi-GPU hardware at home, cloud GPU providers let you rent 4x A100 setups by the hour β often the most practical way to try Maverick without a $4,000+ investment.
If this sounds expensive, check our used GPU buying guide β used A100s have dropped significantly in price.
Method 1: llama.cpp (multi-GPU)
llama.cpp supports splitting models across multiple GPUs. This is the most accessible way to run Maverick locally.
Build with multi-GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# NVIDIA multi-GPU
make -j$(nproc) GGML_CUDA=1
Download the quantized model
# Q4_K_M β best balance (~200 GB)
huggingface-cli download meta-llama/Llama-4-Maverick-400B-GGUF \
llama-4-maverick-400b-Q4_K_M.gguf \
--local-dir ./models
# Q2_K β smallest viable (~100 GB)
huggingface-cli download meta-llama/Llama-4-Maverick-400B-GGUF \
llama-4-maverick-400b-Q2_K.gguf \
--local-dir ./models
Run across multiple GPUs
# Split across 4 GPUs, offload all layers
./llama-cli \
-m models/llama-4-maverick-400b-Q4_K_M.gguf \
-ngl 999 \
--tensor-split 0.25,0.25,0.25,0.25 \
-c 8192 \
-p "Explain the architectural differences between MoE and dense transformers"
The --tensor-split flag distributes the model evenly across GPUs. Adjust the ratios if your GPUs have different VRAM sizes.
Expected performance
| Setup | Quantization | Speed |
|---|---|---|
| 4x RTX 4090 | Q4_K_M | ~5-8 tok/s |
| 2x A100 80GB | Q4_K_M | ~10-15 tok/s |
| 3x RTX 4090 | Q3_K_M | ~6-10 tok/s |
| CPU only (256GB RAM) | Q4_K_M | ~0.5-1 tok/s |
CPU-only inference is technically possible with enough RAM but painfully slow. A GPU setup is strongly recommended.
Method 2: vLLM (production serving)
For serving Maverick to multiple users:
pip install vllm
vllm serve meta-llama/Llama-4-Maverick-400B \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--quantization awq
vLLM handles batching, KV cache management, and multi-GPU coordination automatically. Itβs the best option for production deployments.
Method 3: Cloud GPU rental (cheapest to try)
If you donβt have the hardware, renting cloud GPUs is the fastest way to try Maverick:
| Provider | Setup | Cost/hour | Speed |
|---|---|---|---|
| RunPod | 4x A100 80GB | ~$8/hr | 10-15 tok/s |
| Lambda | 4x A100 80GB | ~$7/hr | 10-15 tok/s |
| Vast.ai | 4x RTX 4090 | ~$4/hr | 5-8 tok/s |
Quick cloud setup (RunPod example)
- Create a RunPod account
- Launch a pod with 4x A100 80GB
- SSH in and run:
pip install vllm
vllm serve meta-llama/Llama-4-Maverick-400B \
--tensor-parallel-size 4 \
--max-model-len 8192
This gives you an OpenAI-compatible API endpoint you can use from anywhere.
Maverick vs Scout: when do you need 400B?
| Scout (17B active) | Maverick (100B active) | |
|---|---|---|
| RAM needed (Q4) | 12 GB | 200 GB |
| Speed | 15-25 tok/s | 5-15 tok/s |
| Quality | Good | Excellent |
| Context | 10M | 10M |
| Runs on laptop | β | β |
For most tasks, Scout is good enough. Maverickβs advantage shows on:
- Complex reasoning β multi-step logic, mathematical proofs, legal analysis
- Long document understanding β using the full 10M context with high accuracy
- Code generation β complex multi-file projects with deep understanding
- Creative writing β more nuanced, coherent long-form content
If youβre not sure you need Maverick, start with Scout. You can always upgrade later.
Maverick vs other large models
| Model | Active params | Context | Local? | Quality |
|---|---|---|---|---|
| Llama 4 Maverick | ~100B | 10M | Multi-GPU | βββββ |
| Qwen 3.5 Plus 110B | ~30B | 128K | Multi-GPU | ββββ |
| Gemma 4 31B | 31B | 256K | Single GPU | ββββ |
| MiMo V2 Pro | 42B active | 1M | API only | βββββ |
Maverick is the most powerful open model you can run locally. The tradeoff is hardware cost. If you donβt need the absolute best, Gemma 4 26B delivers 80% of the quality on a single laptop.
Optimizing performance
Reduce context length
The 10M context window is impressive but uses enormous memory. If you donβt need it:
# Use 8K context instead of 10M (saves ~50% memory)
./llama-cli -m maverick-Q4.gguf -c 8192 -ngl 999
Use Flash Attention
./llama-cli -m maverick-Q4.gguf -fa -ngl 999
Flash Attention reduces memory usage for long contexts by 2-4x.
Offload to CPU
If you donβt have enough VRAM, offload some layers to system RAM:
# Offload 40 layers to GPU, rest stays in RAM
./llama-cli -m maverick-Q4.gguf -ngl 40 -c 4096
This is slower but lets you run on less VRAM. You need enough system RAM to hold the remaining layers.
Is it worth it?
For most developers: no. The hardware cost is high and smaller models handle 90% of tasks well enough.
For researchers, AI enthusiasts, and teams building products that need the best open model available: yes. Maverick at Q4 on 4x RTX 4090 is a legitimate alternative to proprietary APIs β with full privacy, no rate limits, and zero ongoing cost after hardware.
The cheapest way to run AI locally is still a small model on a laptop. But if you want the frontier β Maverick is it.