Llama 4 Scout and Maverick are Metaβs latest open-weight models. Scout has a 10 million token context window. Maverick has 400B parameters with 17B active. Both are free to download and run locally under Metaβs license.
The models
| Llama 4 Scout | Llama 4 Maverick | |
|---|---|---|
| Total params | 109B | 400B |
| Active params | 17B | 17B |
| Context window | 10M tokens | 1M tokens |
| Architecture | MoE (16 experts) | MoE (128 experts) |
| Multimodal | Yes | Yes |
| Languages | 200 | 200 |
| VRAM needed (Q4) | ~16-20GB | ~60-80GB |
Scout is the long-context specialist β it can process entire codebases, book series, or years of chat history in a single prompt. Maverick is the quality leader with stronger reasoning and coding.
Run with Ollama
# Scout β runs on 24GB GPU or 32GB Mac
ollama run llama4-scout
# Maverick β needs serious hardware
ollama run llama4-maverick
Scout is the more practical choice for most developers. It fits on a single GPU and the 10M context window is genuinely useful for repository-level code understanding.
Run with llama.cpp
# Download quantized Scout
huggingface-cli download meta-llama/Llama-4-Scout-GGUF \
--include "*Q4_K_M*" \
--local-dir ./models
# Start server
llama-server \
--model ./models/Llama-4-Scout-Q4_K_M.gguf \
--ctx-size 32768 \
--threads 8 \
--port 8080
Note: the 10M context window is the modelβs maximum capability. In practice, youβll set a smaller context size based on your available VRAM. 32K-128K is practical for most hardware.
Hardware requirements
Scout (recommended):
- Minimum: 16GB VRAM (RTX 4080, M-series Mac with 16GB)
- Recommended: 24GB VRAM (RTX 4090) for larger context
- Optimal: 64GB+ Mac for extended context lengths
Maverick:
- Minimum: 64GB unified memory (Mac Studio) or multi-GPU
- Recommended: 128GB+ for comfortable operation
- Not practical on consumer GPUs without quantization
If Maverickβs memory requirements exceed your local hardware, cloud GPU providers offer high-memory instances that can run it without the upfront cost of a Mac Ultra or multi-GPU rig.
For most developers, Scout is the right choice. It runs on hardware you probably already have and the 10M context capability (even if you only use a fraction of it) is unique among self-hostable models.
Connect to your IDE
# Start Ollama with Scout
ollama run llama4-scout
# In VS Code with Continue, add to config:
# Provider: Ollama
# Model: llama4-scout
Llama 4 vs Qwen 3.5 for self-hosting
Both are strong self-hosted options. Key differences:
- Context: Llama 4 Scout wins (10M vs 256K)
- Benchmarks: Qwen 3.5 wins on most tasks
- Model sizes: Qwen has more options (0.8B to 397B)
- License: Qwen is Apache 2.0 (more permissive), Llama has Metaβs license (700M MAU limit)
- Multimodal: Both support vision natively
If context length is your priority, go Llama 4 Scout. For everything else, Qwen 3.5 is generally stronger.
Related
- Best Self-Hosted AI Models in 2026
- How to Run Qwen 3.5 Locally
- Best Open-Source AI Model in 2026
- Ollama vs llama.cpp vs vLLM β Which Should You Use?
Related: How To Run Llama 4 Maverick Locally