πŸ€– AI Tools
Β· 2 min read

How to Run Llama 4 Locally β€” Scout and Maverick Setup Guide


Llama 4 Scout and Maverick are Meta’s latest open-weight models. Scout has a 10 million token context window. Maverick has 400B parameters with 17B active. Both are free to download and run locally under Meta’s license.

The models

Llama 4 ScoutLlama 4 Maverick
Total params109B400B
Active params17B17B
Context window10M tokens1M tokens
ArchitectureMoE (16 experts)MoE (128 experts)
MultimodalYesYes
Languages200200
VRAM needed (Q4)~16-20GB~60-80GB

Scout is the long-context specialist β€” it can process entire codebases, book series, or years of chat history in a single prompt. Maverick is the quality leader with stronger reasoning and coding.

Run with Ollama

# Scout β€” runs on 24GB GPU or 32GB Mac
ollama run llama4-scout

# Maverick β€” needs serious hardware
ollama run llama4-maverick

Scout is the more practical choice for most developers. It fits on a single GPU and the 10M context window is genuinely useful for repository-level code understanding.

Run with llama.cpp

# Download quantized Scout
huggingface-cli download meta-llama/Llama-4-Scout-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./models

# Start server
llama-server \
  --model ./models/Llama-4-Scout-Q4_K_M.gguf \
  --ctx-size 32768 \
  --threads 8 \
  --port 8080

Note: the 10M context window is the model’s maximum capability. In practice, you’ll set a smaller context size based on your available VRAM. 32K-128K is practical for most hardware.

Hardware requirements

Scout (recommended):

  • Minimum: 16GB VRAM (RTX 4080, M-series Mac with 16GB)
  • Recommended: 24GB VRAM (RTX 4090) for larger context
  • Optimal: 64GB+ Mac for extended context lengths

Maverick:

  • Minimum: 64GB unified memory (Mac Studio) or multi-GPU
  • Recommended: 128GB+ for comfortable operation
  • Not practical on consumer GPUs without quantization

If Maverick’s memory requirements exceed your local hardware, cloud GPU providers offer high-memory instances that can run it without the upfront cost of a Mac Ultra or multi-GPU rig.

For most developers, Scout is the right choice. It runs on hardware you probably already have and the 10M context capability (even if you only use a fraction of it) is unique among self-hostable models.

Connect to your IDE

# Start Ollama with Scout
ollama run llama4-scout

# In VS Code with Continue, add to config:
# Provider: Ollama
# Model: llama4-scout

Llama 4 vs Qwen 3.5 for self-hosting

Both are strong self-hosted options. Key differences:

  • Context: Llama 4 Scout wins (10M vs 256K)
  • Benchmarks: Qwen 3.5 wins on most tasks
  • Model sizes: Qwen has more options (0.8B to 397B)
  • License: Qwen is Apache 2.0 (more permissive), Llama has Meta’s license (700M MAU limit)
  • Multimodal: Both support vision natively

If context length is your priority, go Llama 4 Scout. For everything else, Qwen 3.5 is generally stronger.

Related: How To Run Llama 4 Maverick Locally