DeepSeek V4 Flash Complete Guide: 284B MoE, 13B Active, $0.28/1M Output (2026)
DeepSeek V4 Flash is the cheapest frontier model you can use right now. It packs 284 billion total parameters into a Mixture-of-Experts architecture that only activates 13 billion per forward pass. The result: frontier-class performance at a fraction of the cost.
The numbers speak for themselves. V4 Flash handles 1 million tokens of context, ships under an MIT license, and costs just $0.14 per million input tokens (cache miss) and $0.28 per million output tokens. In its most powerful reasoning configuration, Flash Max mode, it scores 79.0% on SWE-bench Verified. That puts it in the same league as models costing 10 to 100 times more.
Whether you are building production applications on a budget, self-hosting on modest hardware, or looking for an open-weight model that punches well above its weight class, V4 Flash deserves your attention. This guide covers everything you need to know.
For the full-power variant, see our V4 Pro guide. For API setup instructions, check the V4 API guide.
Architecture Deep Dive
V4 Flash builds on the MoE foundation that made DeepSeek competitive with closed-source labs, but pushes efficiency further than any prior release.
Core Specifications
- Total parameters: 284 billion
- Active parameters per forward pass: 13 billion
- Layers: 43 transformer layers
- Hidden dimension: 4096
- Routed experts: 256
- Active experts per token: 6
- Training data: 32 trillion tokens
- Context window: 1 million tokens
- License: MIT
Hybrid Attention: CSA + HCA
V4 Flash introduces a hybrid attention mechanism that combines Compressed Shared Attention (CSA) with Hierarchical Chunked Attention (HCA). This is the key architectural innovation that separates it from V3.2.
CSA compresses key-value representations across attention heads, reducing redundant storage. HCA splits long sequences into hierarchical chunks, allowing the model to attend to distant context without the quadratic cost of full attention.
The practical impact at 1 million token context length:
- FLOPs: 10% of what V3.2 requires for the same context
- KV cache memory: 7% of V3.2βs footprint
This means you can actually use the full 1M context window in production without needing a data center. The efficiency gains compound at longer contexts, making V4 Flash the first model where million-token inference is genuinely affordable.
Expert Routing
With 256 routed experts and only 6 active per token, V4 Flash achieves extreme sparsity. Each token is routed to the 6 most relevant experts based on a learned gating function. This keeps compute costs low while maintaining access to the full 284B parameter knowledge base.
The routing mechanism has been refined from V3.2 to reduce expert load imbalance, which previously caused some experts to be over-utilized while others sat idle.
Three Reasoning Modes
V4 Flash ships with three distinct reasoning modes, each trading off speed against depth of reasoning. You select the mode via the API or by adjusting the thinking budget parameter.
| Mode | Thinking Budget | Speed | Best For |
|---|---|---|---|
| Flash Non-Think | None (direct) | Fastest | Simple queries, classification, extraction |
| Flash High | Medium | Balanced | General coding, analysis, multi-step tasks |
| Flash Max | Maximum | Slowest | Hard math, competitive programming, complex debugging |
Benchmark Comparison Across Modes
| Benchmark | Flash Non-Think | Flash High | Flash Max |
|---|---|---|---|
| SWE-bench Verified | 51.2% | 68.4% | 79.0% |
| AIME 2025 | 42.8% | 71.3% | 85.6% |
| GPQA Diamond | 54.1% | 65.7% | 72.3% |
| HumanEval+ | 84.2% | 89.7% | 92.1% |
| LiveCodeBench v5 | 48.9% | 63.2% | 74.8% |
| MATH-500 | 88.4% | 94.1% | 97.2% |
The jump from Non-Think to Max is substantial. Flash Non-Think is competitive with GPT-4o-class models on most tasks, while Flash Max approaches frontier reasoning performance. The flexibility to switch modes per request makes V4 Flash uniquely versatile for mixed workloads.
Benchmarks vs V4 Pro and Frontier Models
How does V4 Flash stack up against its bigger sibling and the top closed-source models? Here is the comparison using Flash Max mode, which represents V4 Flash at full power.
| Benchmark | V4 Flash Max | V4 Pro | GPT-5.5 | Claude Opus 4 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| SWE-bench Verified | 79.0% | 84.2% | 81.5% | 82.8% | 80.1% |
| AIME 2025 | 85.6% | 91.3% | 88.7% | 87.2% | 86.9% |
| GPQA Diamond | 72.3% | 78.6% | 76.1% | 77.4% | 74.8% |
| HumanEval+ | 92.1% | 94.8% | 93.5% | 93.1% | 92.7% |
| LiveCodeBench v5 | 74.8% | 81.2% | 78.3% | 79.6% | 76.4% |
| MATH-500 | 97.2% | 98.1% | 97.8% | 97.5% | 97.0% |
V4 Flash Max trails V4 Pro by roughly 5 to 7 percentage points on the hardest benchmarks, but lands within striking distance of GPT-5.5 and Claude Opus 4 on most tasks. On MATH-500, the gap essentially disappears.
For a detailed head-to-head breakdown, see our V4 Pro vs Flash comparison.
The key takeaway: V4 Flash Max delivers roughly 90 to 95% of frontier model performance at less than 1% of the cost. For most real-world applications, that tradeoff is a no-brainer.
Pricing Breakdown
V4 Flash is not just cheap. It is absurdly cheap relative to what it can do.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cache Hit (per 1M) |
|---|---|---|---|
| DeepSeek V4 Flash | $0.14 | $0.28 | $0.028 |
| DeepSeek V4 Pro | $0.80 | $2.40 | $0.16 |
| GPT-5.5 | $5.00 | $30.00 | N/A |
| Claude Opus 4 | $5.00 | $25.00 | $0.50 |
| Gemini 2.5 Pro | $2.50 | $15.00 | N/A |
Let those numbers sink in. V4 Flash output tokens cost $0.28 per million. GPT-5.5 charges $30 per million. That is a 107x price difference.
Even compared to other budget options, V4 Flash wins. The cache hit price of $0.028 per million input tokens means that repeated or similar prompts become nearly free.
Cost Example
Processing 10 million input tokens and generating 2 million output tokens per day:
- V4 Flash: $1.40 input + $0.56 output = $1.96/day
- GPT-5.5: $50 input + $60 output = $110/day
- Claude Opus 4: $50 input + $50 output = $100/day
That is $58.80/month with V4 Flash versus $3,300/month with GPT-5.5. For tips on driving costs even lower, read our guide on how to reduce LLM API costs.
When to Use Flash vs Pro
Choosing between V4 Flash and V4 Pro comes down to your performance requirements and budget constraints.
Use V4 Flash when:
- You need to keep API costs under control at scale
- Your tasks are well-served by Flash High or Flash Max reasoning
- You are building user-facing products where per-query cost matters
- You want to self-host on a smaller GPU setup
- You are running high-volume batch processing
- Your accuracy requirements are βvery goodβ rather than βabsolute bestβ
Use V4 Pro when:
- You need peak performance on the hardest reasoning tasks
- You are working on competitive programming or research-level math
- The 5 to 7% accuracy gap on hard benchmarks matters for your use case
- Cost is secondary to quality for your application
For most developers and startups, V4 Flash is the right default choice. Switch to Pro for the specific tasks where you need that extra edge. Many teams run Flash for 90%+ of their requests and route only the hardest queries to Pro.
Self-Hosting V4 Flash
One of V4 Flashβs biggest advantages is self-hosting feasibility. With only 13 billion active parameters per forward pass, the compute requirements during inference are dramatically lower than what the 284B total parameter count might suggest.
Hardware Requirements
Minimum viable setup (FP16 weights, full model):
- 4x NVIDIA A100 80GB or equivalent
- The full 284B parameters need to be stored in memory, but only 13B are active per token
Recommended production setup:
- 8x NVIDIA A100 80GB or 4x NVIDIA H100
- Allows comfortable headroom for KV cache at longer context lengths
Quantized deployment (AWQ/GPTQ 4-bit):
- 2x NVIDIA A100 80GB can handle the quantized model
- Minimal quality loss on most benchmarks
- Best option for budget self-hosting
Frameworks
V4 Flash is supported by vLLM, SGLang, and TensorRT-LLM out of the box. The MIT license means no restrictions on commercial deployment.
For step-by-step setup instructions, see our guide on how to run V4 locally.
Self-Hosting vs API Cost Comparison
If you are processing more than roughly $500/month in API calls, self-hosting on rented GPUs starts to make financial sense. At $1,000+/month in API usage, self-hosting typically saves 40 to 60% depending on your GPU provider and utilization rate.
FAQ
Is V4 Flash good enough for production coding tasks?
Yes. In Flash High mode, it scores 68.4% on SWE-bench Verified, which is competitive with last-generation frontier models. In Flash Max mode, 79.0% puts it ahead of most alternatives. For everyday coding tasks like writing functions, debugging, code review, and refactoring, even Flash Non-Think performs well. Check our list of best budget AI models for more options.
How does the 1M context window actually perform?
The hybrid CSA+HCA attention mechanism means V4 Flash maintains strong recall and reasoning across the full million-token window. The 7% KV cache footprint relative to V3.2 makes this practical rather than theoretical. In needle-in-a-haystack tests, V4 Flash retrieves information reliably up to around 900K tokens, with some degradation in the final 100K.
Can I fine-tune V4 Flash?
Yes. The MIT license permits fine-tuning for any purpose, including commercial use. With only 13B active parameters, LoRA fine-tuning is feasible on a single A100 GPU. Full fine-tuning of all 284B parameters requires a larger cluster, but most use cases are well-served by LoRA or QLoRA approaches targeting the active parameter set.
Should I use V4 Flash or wait for V5?
Use V4 Flash now. There is no public timeline for V5, and V4 Flash already delivers exceptional value. The MIT license and low cost mean there is minimal risk in building on it today. If V5 launches with a better price-performance ratio, migration through the API is straightforward since DeepSeek maintains backward-compatible endpoints.
Bottom Line
DeepSeek V4 Flash rewrites the cost-performance equation for AI. At $0.28 per million output tokens, it delivers 90 to 95% of frontier model quality at roughly 1% of the price. The 13B active parameter design makes self-hosting realistic, the MIT license removes legal friction, and the three reasoning modes let you tune the speed-quality tradeoff per request.
For most teams, V4 Flash should be the default model. Use the V4 API guide to get started in minutes.