Apr 28, 2026 · 5 min read

How to Run Llama 4 Maverick (400B) Locally — Setup Guide (2026)

Llama 4 Maverick is Meta’s largest open model — 400 billion parameters with a 10 million token context window. Running it locally is possible but requires serious hardware. This guide covers your options.

What you’re working with

Spec	Value
Parameters	400B total (MoE)
Active parameters	~100B per inference
Context window	10M tokens
License	Llama License
Modalities	Text + Image
GGUF available	✅

For comparison, Llama 4 Scout (the smaller sibling) has 17B active parameters and runs on a single GPU. Maverick is a different beast entirely.

Hardware requirements

Minimum viable setup

Quantization	Total RAM/VRAM needed	Example setup
FP16	~800 GB	10x A100 80GB
Q8	~400 GB	5x A100 80GB
Q4_K_M	~200 GB	3x A100 80GB or 4x RTX 4090
Q3_K_M	~150 GB	2x A100 80GB or 3x RTX 4090
Q2_K	~100 GB	2x RTX 4090 (48GB each)

The realistic minimum for home use: 2-3x RTX 4090 (24GB each) with Q3 or Q2 quantization. That’s $3,000-4,500 in GPUs alone.

The realistic minimum for quality: 4x RTX 4090 or 2x A100 with Q4_K_M quantization. This preserves most of the model’s capability.

If you don’t have multi-GPU hardware at home, cloud GPU providers let you rent 4x A100 setups by the hour — often the most practical way to try Maverick without a $4,000+ investment.

If this sounds expensive, check our used GPU buying guide — used A100s have dropped significantly in price.

Method 1: llama.cpp (multi-GPU)

llama.cpp supports splitting models across multiple GPUs. This is the most accessible way to run Maverick locally.

Build with multi-GPU support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# NVIDIA multi-GPU
make -j$(nproc) GGML_CUDA=1

Download the quantized model

# Q4_K_M — best balance (~200 GB)
huggingface-cli download meta-llama/Llama-4-Maverick-400B-GGUF \
  llama-4-maverick-400b-Q4_K_M.gguf \
  --local-dir ./models

# Q2_K — smallest viable (~100 GB)
huggingface-cli download meta-llama/Llama-4-Maverick-400B-GGUF \
  llama-4-maverick-400b-Q2_K.gguf \
  --local-dir ./models

Run across multiple GPUs

# Split across 4 GPUs, offload all layers
./llama-cli \
  -m models/llama-4-maverick-400b-Q4_K_M.gguf \
  -ngl 999 \
  --tensor-split 0.25,0.25,0.25,0.25 \
  -c 8192 \
  -p "Explain the architectural differences between MoE and dense transformers"

The --tensor-split flag distributes the model evenly across GPUs. Adjust the ratios if your GPUs have different VRAM sizes.

Expected performance

Setup	Quantization	Speed
4x RTX 4090	Q4_K_M	~5-8 tok/s
2x A100 80GB	Q4_K_M	~10-15 tok/s
3x RTX 4090	Q3_K_M	~6-10 tok/s
CPU only (256GB RAM)	Q4_K_M	~0.5-1 tok/s

CPU-only inference is technically possible with enough RAM but painfully slow. A GPU setup is strongly recommended.

Method 2: vLLM (production serving)

For serving Maverick to multiple users:

pip install vllm

vllm serve meta-llama/Llama-4-Maverick-400B \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --quantization awq

vLLM handles batching, KV cache management, and multi-GPU coordination automatically. It’s the best option for production deployments.

Method 3: Cloud GPU rental (cheapest to try)

If you don’t have the hardware, renting cloud GPUs is the fastest way to try Maverick:

Provider	Setup	Cost/hour	Speed
RunPod	4x A100 80GB	~$8/hr	10-15 tok/s
Lambda	4x A100 80GB	~$7/hr	10-15 tok/s
Vast.ai	4x RTX 4090	~$4/hr	5-8 tok/s

Quick cloud setup (RunPod example)

Create a RunPod account
Launch a pod with 4x A100 80GB
SSH in and run:

pip install vllm
vllm serve meta-llama/Llama-4-Maverick-400B \
  --tensor-parallel-size 4 \
  --max-model-len 8192

This gives you an OpenAI-compatible API endpoint you can use from anywhere.

Maverick vs Scout: when do you need 400B?

	Scout (17B active)	Maverick (100B active)
RAM needed (Q4)	12 GB	200 GB
Speed	15-25 tok/s	5-15 tok/s
Quality	Good	Excellent
Context	10M	10M
Runs on laptop	✅	❌

For most tasks, Scout is good enough. Maverick’s advantage shows on:

Complex reasoning — multi-step logic, mathematical proofs, legal analysis
Long document understanding — using the full 10M context with high accuracy
Code generation — complex multi-file projects with deep understanding
Creative writing — more nuanced, coherent long-form content

If you’re not sure you need Maverick, start with Scout. You can always upgrade later.

Maverick vs other large models

Model	Active params	Context	Local?	Quality
Llama 4 Maverick	~100B	10M	Multi-GPU	⭐⭐⭐⭐⭐
Qwen 3.5 Plus 110B	~30B	128K	Multi-GPU	⭐⭐⭐⭐
Gemma 4 31B	31B	256K	Single GPU	⭐⭐⭐⭐
MiMo V2 Pro	42B active	1M	API only	⭐⭐⭐⭐⭐

Maverick is the most powerful open model you can run locally. The tradeoff is hardware cost. If you don’t need the absolute best, Gemma 4 26B delivers 80% of the quality on a single laptop.

Optimizing performance

Reduce context length

The 10M context window is impressive but uses enormous memory. If you don’t need it:

# Use 8K context instead of 10M (saves ~50% memory)
./llama-cli -m maverick-Q4.gguf -c 8192 -ngl 999

Use Flash Attention

./llama-cli -m maverick-Q4.gguf -fa -ngl 999

Flash Attention reduces memory usage for long contexts by 2-4x.

Offload to CPU

If you don’t have enough VRAM, offload some layers to system RAM:

# Offload 40 layers to GPU, rest stays in RAM
./llama-cli -m maverick-Q4.gguf -ngl 40 -c 4096

This is slower but lets you run on less VRAM. You need enough system RAM to hold the remaining layers.

Is it worth it?

For most developers: no. The hardware cost is high and smaller models handle 90% of tasks well enough.

For researchers, AI enthusiasts, and teams building products that need the best open model available: yes. Maverick at Q4 on 4x RTX 4090 is a legitimate alternative to proprietary APIs — with full privacy, no rate limits, and zero ongoing cost after hardware.

The cheapest way to run AI locally is still a small model on a laptop. But if you want the frontier — Maverick is it.