πŸ€– AI Tools
Β· 5 min read

How to Run Llama 4 Maverick (400B) Locally β€” Setup Guide (2026)


Llama 4 Maverick is Meta’s largest open model β€” 400 billion parameters with a 10 million token context window. Running it locally is possible but requires serious hardware. This guide covers your options.

What you’re working with

SpecValue
Parameters400B total (MoE)
Active parameters~100B per inference
Context window10M tokens
LicenseLlama License
ModalitiesText + Image
GGUF availableβœ…

For comparison, Llama 4 Scout (the smaller sibling) has 17B active parameters and runs on a single GPU. Maverick is a different beast entirely.

Hardware requirements

Minimum viable setup

QuantizationTotal RAM/VRAM neededExample setup
FP16~800 GB10x A100 80GB
Q8~400 GB5x A100 80GB
Q4_K_M~200 GB3x A100 80GB or 4x RTX 4090
Q3_K_M~150 GB2x A100 80GB or 3x RTX 4090
Q2_K~100 GB2x RTX 4090 (48GB each)

The realistic minimum for home use: 2-3x RTX 4090 (24GB each) with Q3 or Q2 quantization. That’s $3,000-4,500 in GPUs alone.

The realistic minimum for quality: 4x RTX 4090 or 2x A100 with Q4_K_M quantization. This preserves most of the model’s capability.

If you don’t have multi-GPU hardware at home, cloud GPU providers let you rent 4x A100 setups by the hour β€” often the most practical way to try Maverick without a $4,000+ investment.

If this sounds expensive, check our used GPU buying guide β€” used A100s have dropped significantly in price.

Method 1: llama.cpp (multi-GPU)

llama.cpp supports splitting models across multiple GPUs. This is the most accessible way to run Maverick locally.

Build with multi-GPU support

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# NVIDIA multi-GPU
make -j$(nproc) GGML_CUDA=1

Download the quantized model

# Q4_K_M β€” best balance (~200 GB)
huggingface-cli download meta-llama/Llama-4-Maverick-400B-GGUF \
  llama-4-maverick-400b-Q4_K_M.gguf \
  --local-dir ./models

# Q2_K β€” smallest viable (~100 GB)
huggingface-cli download meta-llama/Llama-4-Maverick-400B-GGUF \
  llama-4-maverick-400b-Q2_K.gguf \
  --local-dir ./models

Run across multiple GPUs

# Split across 4 GPUs, offload all layers
./llama-cli \
  -m models/llama-4-maverick-400b-Q4_K_M.gguf \
  -ngl 999 \
  --tensor-split 0.25,0.25,0.25,0.25 \
  -c 8192 \
  -p "Explain the architectural differences between MoE and dense transformers"

The --tensor-split flag distributes the model evenly across GPUs. Adjust the ratios if your GPUs have different VRAM sizes.

Expected performance

SetupQuantizationSpeed
4x RTX 4090Q4_K_M~5-8 tok/s
2x A100 80GBQ4_K_M~10-15 tok/s
3x RTX 4090Q3_K_M~6-10 tok/s
CPU only (256GB RAM)Q4_K_M~0.5-1 tok/s

CPU-only inference is technically possible with enough RAM but painfully slow. A GPU setup is strongly recommended.

Method 2: vLLM (production serving)

For serving Maverick to multiple users:

pip install vllm

vllm serve meta-llama/Llama-4-Maverick-400B \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --quantization awq

vLLM handles batching, KV cache management, and multi-GPU coordination automatically. It’s the best option for production deployments.

Method 3: Cloud GPU rental (cheapest to try)

If you don’t have the hardware, renting cloud GPUs is the fastest way to try Maverick:

ProviderSetupCost/hourSpeed
RunPod4x A100 80GB~$8/hr10-15 tok/s
Lambda4x A100 80GB~$7/hr10-15 tok/s
Vast.ai4x RTX 4090~$4/hr5-8 tok/s

Quick cloud setup (RunPod example)

  1. Create a RunPod account
  2. Launch a pod with 4x A100 80GB
  3. SSH in and run:
pip install vllm
vllm serve meta-llama/Llama-4-Maverick-400B \
  --tensor-parallel-size 4 \
  --max-model-len 8192

This gives you an OpenAI-compatible API endpoint you can use from anywhere.

Maverick vs Scout: when do you need 400B?

Scout (17B active)Maverick (100B active)
RAM needed (Q4)12 GB200 GB
Speed15-25 tok/s5-15 tok/s
QualityGoodExcellent
Context10M10M
Runs on laptopβœ…βŒ

For most tasks, Scout is good enough. Maverick’s advantage shows on:

  • Complex reasoning β€” multi-step logic, mathematical proofs, legal analysis
  • Long document understanding β€” using the full 10M context with high accuracy
  • Code generation β€” complex multi-file projects with deep understanding
  • Creative writing β€” more nuanced, coherent long-form content

If you’re not sure you need Maverick, start with Scout. You can always upgrade later.

Maverick vs other large models

ModelActive paramsContextLocal?Quality
Llama 4 Maverick~100B10MMulti-GPU⭐⭐⭐⭐⭐
Qwen 3.5 Plus 110B~30B128KMulti-GPU⭐⭐⭐⭐
Gemma 4 31B31B256KSingle GPU⭐⭐⭐⭐
MiMo V2 Pro42B active1MAPI only⭐⭐⭐⭐⭐

Maverick is the most powerful open model you can run locally. The tradeoff is hardware cost. If you don’t need the absolute best, Gemma 4 26B delivers 80% of the quality on a single laptop.

Optimizing performance

Reduce context length

The 10M context window is impressive but uses enormous memory. If you don’t need it:

# Use 8K context instead of 10M (saves ~50% memory)
./llama-cli -m maverick-Q4.gguf -c 8192 -ngl 999

Use Flash Attention

./llama-cli -m maverick-Q4.gguf -fa -ngl 999

Flash Attention reduces memory usage for long contexts by 2-4x.

Offload to CPU

If you don’t have enough VRAM, offload some layers to system RAM:

# Offload 40 layers to GPU, rest stays in RAM
./llama-cli -m maverick-Q4.gguf -ngl 40 -c 4096

This is slower but lets you run on less VRAM. You need enough system RAM to hold the remaining layers.

Is it worth it?

For most developers: no. The hardware cost is high and smaller models handle 90% of tasks well enough.

For researchers, AI enthusiasts, and teams building products that need the best open model available: yes. Maverick at Q4 on 4x RTX 4090 is a legitimate alternative to proprietary APIs β€” with full privacy, no rate limits, and zero ongoing cost after hardware.

The cheapest way to run AI locally is still a small model on a laptop. But if you want the frontier β€” Maverick is it.