πŸ€– AI Tools
Β· 5 min read
Last updated on

Llama 4 Complete Guide: Scout, Maverick, and Behemoth Explained (2026)


Meta’s Llama 4, released April 2025, brought two major innovations to open-source AI: mixture-of-experts (MoE) architecture for efficient inference, and a 10 million token context window on Scout β€” the longest of any publicly released model. Maverick competes with GPT-4o and Gemini 2.0 Flash on benchmarks while being fully open-weight.

The Llama 4 family

ModelTotal paramsActive paramsContextArchitectureStatus
Scout109B17B10M tokens16 experts, 1 activeβœ… Released
Maverick400B17B1M tokens128 experts, 1 activeβœ… Released
Behemoth2T+~288BTBDMoEπŸ”„ Training

The MoE architecture is key: Maverick has 400B total parameters but only activates 17B per token. This means it runs at roughly the cost and speed of a 17B model while having the knowledge of a 400B model.

Scout: the long-context specialist

10 million tokens of context β€” enough to process entire codebases, book-length documents, or months of conversation history in a single request. Scout uses 16 experts with 1 active, making it efficient enough to run on 4x A100 GPUs.

Best for: large codebase analysis, document processing, long-form research.

Maverick: the generalist

Maverick is the flagship. It surpassed 1400 on LMArena, outperforming GPT-4o, Gemini 2.0 Flash, and DeepSeek V3 at launch. With 128 experts and 17B active parameters, it balances quality and efficiency.

Best for: general coding, chat, reasoning, multimodal tasks.

Behemoth: the teacher (unreleased)

Still in training. Behemoth is a 2T+ parameter model designed as a teacher for distilling knowledge into smaller models. Not intended for direct deployment.

Benchmarks

BenchmarkMaverickScoutGPT-4oGemini 2.0 Flash
LMArena1400+1350+13801370
MMLU85.2%82.1%86.5%84.3%
HumanEval82.4%78.1%84.2%80.5%
Context length1M10M128K1M

Maverick is competitive with frontier proprietary models. Scout trades some quality for the massive context window.

Running Llama 4 locally

Scout (needs ~32GB RAM with quantization)

# With Ollama
ollama pull llama4-scout
ollama run llama4-scout "Analyze this codebase for architectural issues"

# With specific quantization
ollama pull llama4-scout:q4_k_m  # ~25GB, fits in 32GB RAM

Maverick (needs GPU cluster or heavy quantization)

# Maverick is large β€” needs quantization for consumer hardware
ollama pull llama4-maverick:q4_k_m  # ~60GB, needs 64GB+ RAM

# Better option: use via API
# Available on OpenRouter, Together AI, Fireworks AI

For hardware requirements, see our VRAM guide. For Ollama setup, see our Ollama complete guide.

API access

Llama 4 is available through multiple providers since it’s open-weight:

ProviderScout priceMaverick priceNotes
Together AI~$0.10/1M~$0.49/1MCheapest
Fireworks AI~$0.15/1M~$0.50/1MFast
OpenRouterVariesVariesRoutes to cheapest
AWS Bedrock~$0.20/1M~$0.65/1MEnterprise
Self-hostedGPU cost onlyGPU cost onlyFull control

At $0.49/1M tokens for Maverick, it’s roughly 1/5th the cost of GPT-5.4 ($2.50/1M) for comparable quality on many tasks.

Llama 4 vs the competition

Llama 4 MaverickGPT-5.4Claude SonnetGLM-5.1
LicenseLlama license (open)ProprietaryProprietaryMIT (open)
Self-hostβœ…βŒβŒβœ…
API cost~$0.49/1M$2.50/1M$3.00/1MFree tier
Context1M1M1M128K
CodingGoodGoodBestBest (open)
ReasoningGoodGoodGoodGood

Llama 4’s advantage: open-weight + cheap API + self-hostable. Its disadvantage: not the best at any single task β€” it’s a strong generalist.

Scout vs Maverick: which to use

If you need…Use
Maximum context (10M tokens)Scout
Best overall qualityMaverick
Cheapest inferenceScout (fewer active params)
Coding tasksMaverick
Document analysisScout
Running locally on 32GB RAMScout (with quantization)
Running locally on 64GB+ RAMMaverick (with quantization)

For most developers, Maverick is the default choice. Use Scout when you specifically need the massive context window.

The Llama license

Llama 4 uses Meta’s custom license, not MIT or Apache. Key points:

  • Free for commercial use under 700M monthly active users
  • Must include Meta’s attribution
  • Cannot use to train competing models (without permission)
  • More restrictive than GLM-5.1 (MIT) or Qwen 3.5 (Apache 2.0)

For most startups and individual developers, the license is fine. For enterprises, review the terms carefully.

Multimodal capabilities

Both Scout and Maverick natively understand images and text. No separate vision model needed:

from openai import OpenAI

client = OpenAI(base_url="https://api.together.xyz/v1", api_key="...")

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this screenshot?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        ],
    }],
)

This makes Llama 4 useful for UI analysis, document OCR, and visual debugging β€” tasks that previously required separate models.

FAQ

Is Llama 4 free?

Yes, Llama 4 is free to download and use under Meta’s Llama license, which permits commercial use for organizations with fewer than 700 million monthly active users. You can self-host it at no cost beyond your own hardware, or access it cheaply through API providers like Together AI starting at ~$0.10 per million tokens.

Can I run Llama 4 locally?

Yes, Scout can run locally on a machine with 32GB+ RAM using Q4 quantization through Ollama. Maverick requires 64GB+ RAM with heavy quantization, making it more practical to access via API for most developers.

How does Llama 4 compare to GPT-5?

Llama 4 Maverick is competitive with GPT-4o on benchmarks but falls short of GPT-5.4 on most tasks, particularly computer use and complex reasoning. However, Maverick costs roughly 1/5th the price of GPT-5.4 via API and can be self-hosted, making it a strong choice when budget or data privacy matters more than peak performance.

What’s the difference between Scout and Maverick?

Scout has a 10 million token context window (the longest of any public model) with 109B total parameters across 16 experts, making it ideal for processing entire codebases or long documents. Maverick has a 1 million token context with 400B total parameters across 128 experts, delivering higher quality output and better benchmark scores for general coding and reasoning tasks.

Related: How to Run Llama 4 Locally Β· How to Run Llama 4 Maverick Locally Β· Gemma 4 vs Llama 4 vs Qwen 3.5 Β· Falcon vs Llama vs Qwen Β· Best Free AI Models Β· Best Open Source Coding Models Β· VRAM Guide