🤖 AI Tools
· 2 min read

How to Run Llama 4 Locally — Scout and Maverick Setup Guide


Llama 4 Scout and Maverick are Meta’s latest open-weight models. Scout has a 10 million token context window. Maverick has 400B parameters with 17B active. Both are free to download and run locally under Meta’s license.

The models

Llama 4 ScoutLlama 4 Maverick
Total params109B400B
Active params17B17B
Context window10M tokens1M tokens
ArchitectureMoE (16 experts)MoE (128 experts)
MultimodalYesYes
Languages200200
VRAM needed (Q4)~16-20GB~60-80GB

Scout is the long-context specialist — it can process entire codebases, book series, or years of chat history in a single prompt. Maverick is the quality leader with stronger reasoning and coding.

Run with Ollama

# Scout — runs on 24GB GPU or 32GB Mac
ollama run llama4-scout

# Maverick — needs serious hardware
ollama run llama4-maverick

Scout is the more practical choice for most developers. It fits on a single GPU and the 10M context window is genuinely useful for repository-level code understanding.

Run with llama.cpp

# Download quantized Scout
huggingface-cli download meta-llama/Llama-4-Scout-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

# Start server
llama-server \
  --model ./models/Llama-4-Scout-Q4_K_M.gguf \
  --ctx-size 32768 \
  --threads 8 \
  --port 8080

Note: the 10M context window is the model’s maximum capability. In practice, you’ll set a smaller context size based on your available VRAM. 32K-128K is practical for most hardware.

Hardware requirements

Scout (recommended):

  • Minimum: 16GB VRAM (RTX 4080, M-series Mac with 16GB)
  • Recommended: 24GB VRAM (RTX 4090) for larger context
  • Optimal: 64GB+ Mac for extended context lengths

Maverick:

  • Minimum: 64GB unified memory (Mac Studio) or multi-GPU
  • Recommended: 128GB+ for comfortable operation
  • Not practical on consumer GPUs without quantization

For most developers, Scout is the right choice. It runs on hardware you probably already have and the 10M context capability (even if you only use a fraction of it) is unique among self-hostable models.

Connect to your IDE

# Start Ollama with Scout
ollama run llama4-scout

# In VS Code with Continue, add to config:
# Provider: Ollama
# Model: llama4-scout

Llama 4 vs Qwen 3.5 for self-hosting

Both are strong self-hosted options. Key differences:

  • Context: Llama 4 Scout wins (10M vs 256K)
  • Benchmarks: Qwen 3.5 wins on most tasks
  • Model sizes: Qwen has more options (0.8B to 397B)
  • License: Qwen is Apache 2.0 (more permissive), Llama has Meta’s license (700M MAU limit)
  • Multimodal: Both support vision natively

If context length is your priority, go Llama 4 Scout. For everything else, Qwen 3.5 is generally stronger.