Apr 2, 2026 · 2 min read

How to Run Llama 4 Locally — Scout and Maverick Setup Guide

Llama 4 Scout and Maverick are Meta’s latest open-weight models. Scout has a 10 million token context window. Maverick has 400B parameters with 17B active. Both are free to download and run locally under Meta’s license.

The models

	Llama 4 Scout	Llama 4 Maverick
Total params	109B	400B
Active params	17B	17B
Context window	10M tokens	1M tokens
Architecture	MoE (16 experts)	MoE (128 experts)
Multimodal	Yes	Yes
Languages	200	200
VRAM needed (Q4)	~16-20GB	~60-80GB

Scout is the long-context specialist — it can process entire codebases, book series, or years of chat history in a single prompt. Maverick is the quality leader with stronger reasoning and coding.

Run with Ollama

# Scout — runs on 24GB GPU or 32GB Mac
ollama run llama4-scout

# Maverick — needs serious hardware
ollama run llama4-maverick

Scout is the more practical choice for most developers. It fits on a single GPU and the 10M context window is genuinely useful for repository-level code understanding.

Run with llama.cpp

# Download quantized Scout
huggingface-cli download meta-llama/Llama-4-Scout-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

# Start server
llama-server \
  --model ./models/Llama-4-Scout-Q4_K_M.gguf \
  --ctx-size 32768 \
  --threads 8 \
  --port 8080

Note: the 10M context window is the model’s maximum capability. In practice, you’ll set a smaller context size based on your available VRAM. 32K-128K is practical for most hardware.

Hardware requirements

Scout (recommended):

Minimum: 16GB VRAM (RTX 4080, M-series Mac with 16GB)
Recommended: 24GB VRAM (RTX 4090) for larger context
Optimal: 64GB+ Mac for extended context lengths

Maverick:

Minimum: 64GB unified memory (Mac Studio) or multi-GPU
Recommended: 128GB+ for comfortable operation
Not practical on consumer GPUs without quantization

If Maverick’s memory requirements exceed your local hardware, cloud GPU providers offer high-memory instances that can run it without the upfront cost of a Mac Ultra or multi-GPU rig.

For most developers, Scout is the right choice. It runs on hardware you probably already have and the 10M context capability (even if you only use a fraction of it) is unique among self-hostable models.

Connect to your IDE

# Start Ollama with Scout
ollama run llama4-scout

# In VS Code with Continue, add to config:
# Provider: Ollama
# Model: llama4-scout

Llama 4 vs Qwen 3.5 for self-hosting

Both are strong self-hosted options. Key differences:

Context: Llama 4 Scout wins (10M vs 256K)
Benchmarks: Qwen 3.5 wins on most tasks
Model sizes: Qwen has more options (0.8B to 397B)
License: Qwen is Apache 2.0 (more permissive), Llama has Meta’s license (700M MAU limit)
Multimodal: Both support vision natively

If context length is your priority, go Llama 4 Scout. For everything else, Qwen 3.5 is generally stronger.

How to Run Llama 4 Locally — Scout and Maverick Setup Guide

The models

Run with Ollama

Run with llama.cpp

Hardware requirements

Connect to your IDE

Llama 4 vs Qwen 3.5 for self-hosting

Related

📬 AI Dev Weekly

You might also like

How to Run Llama 4 Maverick (400B) Locally — Setup Guide (2026)

How to Run GLM-5.1 Locally — Hardware, Setup, and Quantization Guide (2026)

How to Replace GitHub Copilot for Free — Step-by-Step Guide (2026)

How to Run AI Without a GPU — CPU-Only Inference Guide (2026)