Llama 4 Complete Guide: Scout, Maverick, and Behemoth Explained (2026)
Metaβs Llama 4, released April 2025, brought two major innovations to open-source AI: mixture-of-experts (MoE) architecture for efficient inference, and a 10 million token context window on Scout β the longest of any publicly released model. Maverick competes with GPT-4o and Gemini 2.0 Flash on benchmarks while being fully open-weight.
The Llama 4 family
| Model | Total params | Active params | Context | Architecture | Status |
|---|---|---|---|---|---|
| Scout | 109B | 17B | 10M tokens | 16 experts, 1 active | β Released |
| Maverick | 400B | 17B | 1M tokens | 128 experts, 1 active | β Released |
| Behemoth | 2T+ | ~288B | TBD | MoE | π Training |
The MoE architecture is key: Maverick has 400B total parameters but only activates 17B per token. This means it runs at roughly the cost and speed of a 17B model while having the knowledge of a 400B model.
Scout: the long-context specialist
10 million tokens of context β enough to process entire codebases, book-length documents, or months of conversation history in a single request. Scout uses 16 experts with 1 active, making it efficient enough to run on 4x A100 GPUs.
Best for: large codebase analysis, document processing, long-form research.
Maverick: the generalist
Maverick is the flagship. It surpassed 1400 on LMArena, outperforming GPT-4o, Gemini 2.0 Flash, and DeepSeek V3 at launch. With 128 experts and 17B active parameters, it balances quality and efficiency.
Best for: general coding, chat, reasoning, multimodal tasks.
Behemoth: the teacher (unreleased)
Still in training. Behemoth is a 2T+ parameter model designed as a teacher for distilling knowledge into smaller models. Not intended for direct deployment.
Benchmarks
| Benchmark | Maverick | Scout | GPT-4o | Gemini 2.0 Flash |
|---|---|---|---|---|
| LMArena | 1400+ | 1350+ | 1380 | 1370 |
| MMLU | 85.2% | 82.1% | 86.5% | 84.3% |
| HumanEval | 82.4% | 78.1% | 84.2% | 80.5% |
| Context length | 1M | 10M | 128K | 1M |
Maverick is competitive with frontier proprietary models. Scout trades some quality for the massive context window.
Running Llama 4 locally
Scout (needs ~32GB RAM with quantization)
# With Ollama
ollama pull llama4-scout
ollama run llama4-scout "Analyze this codebase for architectural issues"
# With specific quantization
ollama pull llama4-scout:q4_k_m # ~25GB, fits in 32GB RAM
Maverick (needs GPU cluster or heavy quantization)
# Maverick is large β needs quantization for consumer hardware
ollama pull llama4-maverick:q4_k_m # ~60GB, needs 64GB+ RAM
# Better option: use via API
# Available on OpenRouter, Together AI, Fireworks AI
For hardware requirements, see our VRAM guide. For Ollama setup, see our Ollama complete guide.
API access
Llama 4 is available through multiple providers since itβs open-weight:
| Provider | Scout price | Maverick price | Notes |
|---|---|---|---|
| Together AI | ~$0.10/1M | ~$0.49/1M | Cheapest |
| Fireworks AI | ~$0.15/1M | ~$0.50/1M | Fast |
| OpenRouter | Varies | Varies | Routes to cheapest |
| AWS Bedrock | ~$0.20/1M | ~$0.65/1M | Enterprise |
| Self-hosted | GPU cost only | GPU cost only | Full control |
At $0.49/1M tokens for Maverick, itβs roughly 1/5th the cost of GPT-5.4 ($2.50/1M) for comparable quality on many tasks.
Llama 4 vs the competition
| Llama 4 Maverick | GPT-5.4 | Claude Sonnet | GLM-5.1 | |
|---|---|---|---|---|
| License | Llama license (open) | Proprietary | Proprietary | MIT (open) |
| Self-host | β | β | β | β |
| API cost | ~$0.49/1M | $2.50/1M | $3.00/1M | Free tier |
| Context | 1M | 1M | 1M | 128K |
| Coding | Good | Good | Best | Best (open) |
| Reasoning | Good | Good | Good | Good |
Llama 4βs advantage: open-weight + cheap API + self-hostable. Its disadvantage: not the best at any single task β itβs a strong generalist.
Scout vs Maverick: which to use
| If you need⦠| Use |
|---|---|
| Maximum context (10M tokens) | Scout |
| Best overall quality | Maverick |
| Cheapest inference | Scout (fewer active params) |
| Coding tasks | Maverick |
| Document analysis | Scout |
| Running locally on 32GB RAM | Scout (with quantization) |
| Running locally on 64GB+ RAM | Maverick (with quantization) |
For most developers, Maverick is the default choice. Use Scout when you specifically need the massive context window.
The Llama license
Llama 4 uses Metaβs custom license, not MIT or Apache. Key points:
- Free for commercial use under 700M monthly active users
- Must include Metaβs attribution
- Cannot use to train competing models (without permission)
- More restrictive than GLM-5.1 (MIT) or Qwen 3.5 (Apache 2.0)
For most startups and individual developers, the license is fine. For enterprises, review the terms carefully.
Multimodal capabilities
Both Scout and Maverick natively understand images and text. No separate vision model needed:
from openai import OpenAI
client = OpenAI(base_url="https://api.together.xyz/v1", api_key="...")
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this screenshot?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
],
}],
)
This makes Llama 4 useful for UI analysis, document OCR, and visual debugging β tasks that previously required separate models.
FAQ
Is Llama 4 free?
Yes, Llama 4 is free to download and use under Metaβs Llama license, which permits commercial use for organizations with fewer than 700 million monthly active users. You can self-host it at no cost beyond your own hardware, or access it cheaply through API providers like Together AI starting at ~$0.10 per million tokens.
Can I run Llama 4 locally?
Yes, Scout can run locally on a machine with 32GB+ RAM using Q4 quantization through Ollama. Maverick requires 64GB+ RAM with heavy quantization, making it more practical to access via API for most developers.
How does Llama 4 compare to GPT-5?
Llama 4 Maverick is competitive with GPT-4o on benchmarks but falls short of GPT-5.4 on most tasks, particularly computer use and complex reasoning. However, Maverick costs roughly 1/5th the price of GPT-5.4 via API and can be self-hosted, making it a strong choice when budget or data privacy matters more than peak performance.
Whatβs the difference between Scout and Maverick?
Scout has a 10 million token context window (the longest of any public model) with 109B total parameters across 16 experts, making it ideal for processing entire codebases or long documents. Maverick has a 1 million token context with 400B total parameters across 128 experts, delivering higher quality output and better benchmark scores for general coding and reasoning tasks.
Related: How to Run Llama 4 Locally Β· How to Run Llama 4 Maverick Locally Β· Gemma 4 vs Llama 4 vs Qwen 3.5 Β· Falcon vs Llama vs Qwen Β· Best Free AI Models Β· Best Open Source Coding Models Β· VRAM Guide