Apr 22, 2026 · 5 min read

Last updated on Apr 19, 2026

Llama 4 Complete Guide: Scout, Maverick, and Behemoth Explained (2026)

Meta’s Llama 4, released April 2025, brought two major innovations to open-source AI: mixture-of-experts (MoE) architecture for efficient inference, and a 10 million token context window on Scout — the longest of any publicly released model. Maverick competes with GPT-4o and Gemini 2.0 Flash on benchmarks while being fully open-weight.

The Llama 4 family

Model	Total params	Active params	Context	Architecture	Status
Scout	109B	17B	10M tokens	16 experts, 1 active	✅ Released
Maverick	400B	17B	1M tokens	128 experts, 1 active	✅ Released
Behemoth	2T+	~288B	TBD	MoE	🔄 Training

The MoE architecture is key: Maverick has 400B total parameters but only activates 17B per token. This means it runs at roughly the cost and speed of a 17B model while having the knowledge of a 400B model.

Scout: the long-context specialist

10 million tokens of context — enough to process entire codebases, book-length documents, or months of conversation history in a single request. Scout uses 16 experts with 1 active, making it efficient enough to run on 4x A100 GPUs.

Best for: large codebase analysis, document processing, long-form research.

Maverick: the generalist

Maverick is the flagship. It surpassed 1400 on LMArena, outperforming GPT-4o, Gemini 2.0 Flash, and DeepSeek V3 at launch. With 128 experts and 17B active parameters, it balances quality and efficiency.

Best for: general coding, chat, reasoning, multimodal tasks.

Behemoth: the teacher (unreleased)

Still in training. Behemoth is a 2T+ parameter model designed as a teacher for distilling knowledge into smaller models. Not intended for direct deployment.

Benchmarks

Benchmark	Maverick	Scout	GPT-4o	Gemini 2.0 Flash
LMArena	1400+	1350+	1380	1370
MMLU	85.2%	82.1%	86.5%	84.3%
HumanEval	82.4%	78.1%	84.2%	80.5%
Context length	1M	10M	128K	1M

Maverick is competitive with frontier proprietary models. Scout trades some quality for the massive context window.

Running Llama 4 locally

Scout (needs ~32GB RAM with quantization)

# With Ollama
ollama pull llama4-scout
ollama run llama4-scout "Analyze this codebase for architectural issues"

# With specific quantization
ollama pull llama4-scout:q4_k_m  # ~25GB, fits in 32GB RAM

Maverick (needs GPU cluster or heavy quantization)

# Maverick is large — needs quantization for consumer hardware
ollama pull llama4-maverick:q4_k_m  # ~60GB, needs 64GB+ RAM

# Better option: use via API
# Available on OpenRouter, Together AI, Fireworks AI

For hardware requirements, see our VRAM guide. For Ollama setup, see our Ollama complete guide.

API access

Llama 4 is available through multiple providers since it’s open-weight:

Provider	Scout price	Maverick price	Notes
Together AI	~$0.10/1M	~$0.49/1M	Cheapest
Fireworks AI	~$0.15/1M	~$0.50/1M	Fast
OpenRouter	Varies	Varies	Routes to cheapest
AWS Bedrock	~$0.20/1M	~$0.65/1M	Enterprise
Self-hosted	GPU cost only	GPU cost only	Full control

At $0.49/1M tokens for Maverick, it’s roughly 1/5th the cost of GPT-5.4 ($2.50/1M) for comparable quality on many tasks.

Llama 4 vs the competition

	Llama 4 Maverick	GPT-5.4	Claude Sonnet	GLM-5.1
License	Llama license (open)	Proprietary	Proprietary	MIT (open)
Self-host	✅	❌	❌	✅
API cost	~$0.49/1M	$2.50/1M	$3.00/1M	Free tier
Context	1M	1M	1M	128K
Coding	Good	Good	Best	Best (open)
Reasoning	Good	Good	Good	Good

Llama 4’s advantage: open-weight + cheap API + self-hostable. Its disadvantage: not the best at any single task — it’s a strong generalist.

Scout vs Maverick: which to use

If you need…	Use
Maximum context (10M tokens)	Scout
Best overall quality	Maverick
Cheapest inference	Scout (fewer active params)
Coding tasks	Maverick
Document analysis	Scout
Running locally on 32GB RAM	Scout (with quantization)
Running locally on 64GB+ RAM	Maverick (with quantization)

For most developers, Maverick is the default choice. Use Scout when you specifically need the massive context window.

The Llama license

Llama 4 uses Meta’s custom license, not MIT or Apache. Key points:

Free for commercial use under 700M monthly active users
Must include Meta’s attribution
Cannot use to train competing models (without permission)
More restrictive than GLM-5.1 (MIT) or Qwen 3.5 (Apache 2.0)

For most startups and individual developers, the license is fine. For enterprises, review the terms carefully.

Multimodal capabilities

Both Scout and Maverick natively understand images and text. No separate vision model needed:

from openai import OpenAI

client = OpenAI(base_url="https://api.together.xyz/v1", api_key="...")

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this screenshot?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        ],
    }],
)

This makes Llama 4 useful for UI analysis, document OCR, and visual debugging — tasks that previously required separate models.

FAQ

Is Llama 4 free?

Yes, Llama 4 is free to download and use under Meta’s Llama license, which permits commercial use for organizations with fewer than 700 million monthly active users. You can self-host it at no cost beyond your own hardware, or access it cheaply through API providers like Together AI starting at ~$0.10 per million tokens.

Can I run Llama 4 locally?

Yes, Scout can run locally on a machine with 32GB+ RAM using Q4 quantization through Ollama. Maverick requires 64GB+ RAM with heavy quantization, making it more practical to access via API for most developers.

How does Llama 4 compare to GPT-5?

Llama 4 Maverick is competitive with GPT-4o on benchmarks but falls short of GPT-5.4 on most tasks, particularly computer use and complex reasoning. However, Maverick costs roughly 1/5th the price of GPT-5.4 via API and can be self-hosted, making it a strong choice when budget or data privacy matters more than peak performance.

What’s the difference between Scout and Maverick?

Scout has a 10 million token context window (the longest of any public model) with 109B total parameters across 16 experts, making it ideal for processing entire codebases or long documents. Maverick has a 1 million token context with 400B total parameters across 128 experts, delivering higher quality output and better benchmark scores for general coding and reasoning tasks.