Apr 24, 2026 · 8 min read

DeepSeek V4 Flash Complete Guide: 284B MoE, 13B Active, $0.28/1M Output (2026)

DeepSeek V4 Flash is the cheapest frontier model you can use right now. It packs 284 billion total parameters into a Mixture-of-Experts architecture that only activates 13 billion per forward pass. The result: frontier-class performance at a fraction of the cost.

The numbers speak for themselves. V4 Flash handles 1 million tokens of context, ships under an MIT license, and costs just $0.14 per million input tokens (cache miss) and $0.28 per million output tokens. In its most powerful reasoning configuration, Flash Max mode, it scores 79.0% on SWE-bench Verified. That puts it in the same league as models costing 10 to 100 times more.

Whether you are building production applications on a budget, self-hosting on modest hardware, or looking for an open-weight model that punches well above its weight class, V4 Flash deserves your attention. This guide covers everything you need to know.

For the full-power variant, see our V4 Pro guide. For API setup instructions, check the V4 API guide.

Architecture Deep Dive

V4 Flash builds on the MoE foundation that made DeepSeek competitive with closed-source labs, but pushes efficiency further than any prior release.

Core Specifications

Total parameters: 284 billion
Active parameters per forward pass: 13 billion
Layers: 43 transformer layers
Hidden dimension: 4096
Routed experts: 256
Active experts per token: 6
Training data: 32 trillion tokens
Context window: 1 million tokens
License: MIT

Hybrid Attention: CSA + HCA

V4 Flash introduces a hybrid attention mechanism that combines Compressed Shared Attention (CSA) with Hierarchical Chunked Attention (HCA). This is the key architectural innovation that separates it from V3.2.

CSA compresses key-value representations across attention heads, reducing redundant storage. HCA splits long sequences into hierarchical chunks, allowing the model to attend to distant context without the quadratic cost of full attention.

The practical impact at 1 million token context length:

FLOPs: 10% of what V3.2 requires for the same context
KV cache memory: 7% of V3.2’s footprint

This means you can actually use the full 1M context window in production without needing a data center. The efficiency gains compound at longer contexts, making V4 Flash the first model where million-token inference is genuinely affordable.

Expert Routing

With 256 routed experts and only 6 active per token, V4 Flash achieves extreme sparsity. Each token is routed to the 6 most relevant experts based on a learned gating function. This keeps compute costs low while maintaining access to the full 284B parameter knowledge base.

The routing mechanism has been refined from V3.2 to reduce expert load imbalance, which previously caused some experts to be over-utilized while others sat idle.

Three Reasoning Modes

V4 Flash ships with three distinct reasoning modes, each trading off speed against depth of reasoning. You select the mode via the API or by adjusting the thinking budget parameter.

Mode	Thinking Budget	Speed	Best For
Flash Non-Think	None (direct)	Fastest	Simple queries, classification, extraction
Flash High	Medium	Balanced	General coding, analysis, multi-step tasks
Flash Max	Maximum	Slowest	Hard math, competitive programming, complex debugging

Benchmark Comparison Across Modes

Benchmark	Flash Non-Think	Flash High	Flash Max
SWE-bench Verified	51.2%	68.4%	79.0%
AIME 2025	42.8%	71.3%	85.6%
GPQA Diamond	54.1%	65.7%	72.3%
HumanEval+	84.2%	89.7%	92.1%
LiveCodeBench v5	48.9%	63.2%	74.8%
MATH-500	88.4%	94.1%	97.2%

The jump from Non-Think to Max is substantial. Flash Non-Think is competitive with GPT-4o-class models on most tasks, while Flash Max approaches frontier reasoning performance. The flexibility to switch modes per request makes V4 Flash uniquely versatile for mixed workloads.

Benchmarks vs V4 Pro and Frontier Models

How does V4 Flash stack up against its bigger sibling and the top closed-source models? Here is the comparison using Flash Max mode, which represents V4 Flash at full power.

Benchmark	V4 Flash Max	V4 Pro	GPT-5.5	Claude Opus 4	Gemini 2.5 Pro
SWE-bench Verified	79.0%	84.2%	81.5%	82.8%	80.1%
AIME 2025	85.6%	91.3%	88.7%	87.2%	86.9%
GPQA Diamond	72.3%	78.6%	76.1%	77.4%	74.8%
HumanEval+	92.1%	94.8%	93.5%	93.1%	92.7%
LiveCodeBench v5	74.8%	81.2%	78.3%	79.6%	76.4%
MATH-500	97.2%	98.1%	97.8%	97.5%	97.0%

V4 Flash Max trails V4 Pro by roughly 5 to 7 percentage points on the hardest benchmarks, but lands within striking distance of GPT-5.5 and Claude Opus 4 on most tasks. On MATH-500, the gap essentially disappears.

For a detailed head-to-head breakdown, see our V4 Pro vs Flash comparison.

The key takeaway: V4 Flash Max delivers roughly 90 to 95% of frontier model performance at less than 1% of the cost. For most real-world applications, that tradeoff is a no-brainer.

Pricing Breakdown

V4 Flash is not just cheap. It is absurdly cheap relative to what it can do.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cache Hit (per 1M)
DeepSeek V4 Flash	$0.14	$0.28	$0.028
DeepSeek V4 Pro	$0.80	$2.40	$0.16
GPT-5.5	$5.00	$30.00	N/A
Claude Opus 4	$5.00	$25.00	$0.50
Gemini 2.5 Pro	$2.50	$15.00	N/A

Let those numbers sink in. V4 Flash output tokens cost $0.28 per million. GPT-5.5 charges $30 per million. That is a 107x price difference.

Even compared to other budget options, V4 Flash wins. The cache hit price of $0.028 per million input tokens means that repeated or similar prompts become nearly free.

Cost Example

Processing 10 million input tokens and generating 2 million output tokens per day:

V4 Flash: $1.40 input + $0.56 output = $1.96/day
GPT-5.5: $50 input + $60 output = $110/day
Claude Opus 4: $50 input + $50 output = $100/day

That is $58.80/month with V4 Flash versus $3,300/month with GPT-5.5. For tips on driving costs even lower, read our guide on how to reduce LLM API costs.

When to Use Flash vs Pro

Choosing between V4 Flash and V4 Pro comes down to your performance requirements and budget constraints.

Use V4 Flash when:

You need to keep API costs under control at scale
Your tasks are well-served by Flash High or Flash Max reasoning
You are building user-facing products where per-query cost matters
You want to self-host on a smaller GPU setup
You are running high-volume batch processing
Your accuracy requirements are “very good” rather than “absolute best”

Use V4 Pro when:

You need peak performance on the hardest reasoning tasks
You are working on competitive programming or research-level math
The 5 to 7% accuracy gap on hard benchmarks matters for your use case
Cost is secondary to quality for your application

For most developers and startups, V4 Flash is the right default choice. Switch to Pro for the specific tasks where you need that extra edge. Many teams run Flash for 90%+ of their requests and route only the hardest queries to Pro.

Self-Hosting V4 Flash

One of V4 Flash’s biggest advantages is self-hosting feasibility. With only 13 billion active parameters per forward pass, the compute requirements during inference are dramatically lower than what the 284B total parameter count might suggest.

Hardware Requirements

Minimum viable setup (FP16 weights, full model):

4x NVIDIA A100 80GB or equivalent
The full 284B parameters need to be stored in memory, but only 13B are active per token

Recommended production setup:

8x NVIDIA A100 80GB or 4x NVIDIA H100
Allows comfortable headroom for KV cache at longer context lengths

Quantized deployment (AWQ/GPTQ 4-bit):

2x NVIDIA A100 80GB can handle the quantized model
Minimal quality loss on most benchmarks
Best option for budget self-hosting

Frameworks

V4 Flash is supported by vLLM, SGLang, and TensorRT-LLM out of the box. The MIT license means no restrictions on commercial deployment.

For step-by-step setup instructions, see our guide on how to run V4 locally.

Self-Hosting vs API Cost Comparison

If you are processing more than roughly $500/month in API calls, self-hosting on rented GPUs starts to make financial sense. At $1,000+/month in API usage, self-hosting typically saves 40 to 60% depending on your GPU provider and utilization rate.

FAQ

Is V4 Flash good enough for production coding tasks?

Yes. In Flash High mode, it scores 68.4% on SWE-bench Verified, which is competitive with last-generation frontier models. In Flash Max mode, 79.0% puts it ahead of most alternatives. For everyday coding tasks like writing functions, debugging, code review, and refactoring, even Flash Non-Think performs well. Check our list of best budget AI models for more options.

How does the 1M context window actually perform?

The hybrid CSA+HCA attention mechanism means V4 Flash maintains strong recall and reasoning across the full million-token window. The 7% KV cache footprint relative to V3.2 makes this practical rather than theoretical. In needle-in-a-haystack tests, V4 Flash retrieves information reliably up to around 900K tokens, with some degradation in the final 100K.

Can I fine-tune V4 Flash?

Yes. The MIT license permits fine-tuning for any purpose, including commercial use. With only 13B active parameters, LoRA fine-tuning is feasible on a single A100 GPU. Full fine-tuning of all 284B parameters requires a larger cluster, but most use cases are well-served by LoRA or QLoRA approaches targeting the active parameter set.

Should I use V4 Flash or wait for V5?

Use V4 Flash now. There is no public timeline for V5, and V4 Flash already delivers exceptional value. The MIT license and low cost mean there is minimal risk in building on it today. If V5 launches with a better price-performance ratio, migration through the API is straightforward since DeepSeek maintains backward-compatible endpoints.

Bottom Line

DeepSeek V4 Flash rewrites the cost-performance equation for AI. At $0.28 per million output tokens, it delivers 90 to 95% of frontier model quality at roughly 1% of the price. The 13B active parameter design makes self-hosting realistic, the MIT license removes legal friction, and the three reasoning modes let you tune the speed-quality tradeoff per request.

For most teams, V4 Flash should be the default model. Use the V4 API guide to get started in minutes.