DeepSeek V4 landed in April 2026 with two variants: V4-Pro and V4-Flash. Both replace V3.2 with larger capacity, longer context, and a redesigned attention mechanism. If you are running V3.2 in production, the clock is ticking. V3 endpoints retire on July 24, 2026.
This guide breaks down every difference between V3.2 and the V4 family so you can decide which model fits your workload and plan your migration. For deeper dives, see the V4 Pro guide and the V4 Flash guide.
Model specs at a glance
| Spec | V3.2 | V4-Pro | V4-Flash |
|---|---|---|---|
| Total parameters | 671B | 1.6T | 284B |
| Active parameters | 37B | 49B | 13B |
| Max context window | 128K tokens | 1M tokens | 1M tokens |
| Expert precision | FP8 | FP4 | FP4 |
| Attention type | MLA | CSA + HCA hybrid | CSA + HCA hybrid |
| Optimizer | AdamW | Muon | Muon |
| Release date | 2025-09 | 2026-04 | 2026-04 |
V4-Pro is the flagship. It more than doubles total parameter count while only bumping active parameters from 37B to 49B, keeping per-token compute manageable. V4-Flash targets cost-sensitive workloads with just 13B active parameters and a smaller total footprint than V3.2 itself.
Both V4 models push the context window from 128K to 1M tokens, which is the headline feature for anyone working with large codebases or long documents.
Architecture changes
Hybrid attention: CSA + HCA
V3.2 used Multi-head Latent Attention (MLA) across all layers. V4 replaces this with a hybrid of two new mechanisms:
- Chunked Sliding-window Attention (CSA) handles local context. It processes tokens in fixed-size chunks with a sliding window, reducing memory use for nearby tokens.
- Hierarchical Condensed Attention (HCA) handles global context. It compresses distant tokens into summary representations, letting the model attend over 1M tokens without quadratic cost.
The combination means V4 can process long sequences efficiently while still capturing fine-grained local patterns. This is the core reason the 1M context window is practical rather than theoretical.
Multi-head Condensation (mHC)
mHC sits between CSA and HCA layers. It pools attention heads into condensed representations before passing them to the global attention stage. This reduces the number of key-value pairs HCA needs to process, which is the main driver behind the KV cache savings described below.
Muon optimizer
V4 training switched from AdamW to the Muon optimizer. Muon uses momentum-based updates with unit-norm constraints, which stabilized training at the 1.6T parameter scale. DeepSeek reported fewer loss spikes and faster convergence compared to their V3.2 training runs.
FP4 experts
All Mixture-of-Experts layers in V4 use FP4 quantized weights, down from FP8 in V3.2. This cuts expert memory footprint roughly in half, which is how V4-Pro fits 1.6T total parameters on the same serving infrastructure that ran V3.2βs 671B.
Efficiency gains
The architecture changes translate directly into compute and memory savings, especially at long context lengths.
| Metric (at 1M context) | V4-Pro vs V3.2 | V4-Flash vs V3.2 |
|---|---|---|
| FLOPs per token | 27% of V3.2 | 10% of V3.2 |
| KV cache size | 10% of V3.2 | 7% of V3.2 |
V4-Pro uses only 27% of the FLOPs V3.2 would need at 1M context, and V4-Flash drops that to 10%. KV cache reduction is even more dramatic: V4-Pro needs just 10% of V3.2βs cache, and V4-Flash needs 7%. This is what makes 1M-token inference viable on current hardware.
At shorter context lengths (under 32K), the efficiency gap narrows. V4-Pro runs at roughly similar cost to V3.2 for short prompts, while V4-Flash remains cheaper across all context lengths due to its smaller active parameter count.
Benchmark improvements
| Benchmark | V3.2 | V4-Pro | V4-Flash |
|---|---|---|---|
| MMLU-Pro | 75.9 | 82.4 | 78.1 |
| HumanEval+ | 76.2 | 85.7 | 80.3 |
| LiveCodeBench (2026-Q1) | 41.8 | 54.6 | 47.2 |
| MATH-500 | 82.1 | 89.3 | 85.0 |
| GPQA-Diamond | 46.3 | 58.1 | 51.7 |
| LongBench v2 (128K) | 68.4 | 79.2 | 74.8 |
| RULER (1M) | N/A | 91.3 | 88.6 |
V4-Pro leads across every benchmark. The biggest jumps are in coding (LiveCodeBench +12.8 points) and long-context tasks (LongBench +10.8 points). V4-Flash consistently lands between V3.2 and V4-Pro, making it a solid middle ground.
The RULER 1M benchmark is new for V4 since V3.2 could not handle that context length. Both V4 models score above 88, confirming the 1M window is functional and not just a spec sheet number.
For a comparison with reasoning models, see V4 vs R1.
Pricing comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cache hits (per 1M tokens) |
|---|---|---|---|
| V3.2 | $0.27 | $1.10 | $0.07 |
| V4-Pro | $0.40 | $1.60 | $0.10 |
| V4-Flash | $0.10 | $0.40 | $0.03 |
V4-Pro costs roughly 45% more than V3.2 per token, but the efficiency gains at long context lengths can offset that. If your average prompt exceeds 64K tokens, V4-Pro may actually cost less per request than V3.2 due to lower compute overhead.
V4-Flash is the budget option. At $0.10 per million input tokens, it undercuts V3.2 by more than 60% while delivering better benchmark scores. For high-volume workloads that do not need peak accuracy, V4-Flash is the obvious pick.
Full API details, rate limits, and endpoint configuration are covered in the V4 API guide.
Migration guide
Timeline
DeepSeek announced that all V3.x API endpoints will be retired on July 24, 2026. After that date, requests to deepseek-v3 or deepseek-v3.2 model identifiers will return 404 errors.
Steps to migrate
- Pick your model. Use
deepseek-v4-profor tasks requiring maximum accuracy or long context. Usedeepseek-v4-flashfor cost-sensitive or latency-sensitive workloads. - Update model identifiers. Replace
deepseek-v3ordeepseek-v3.2withdeepseek-v4-proordeepseek-v4-flashin your API calls. - Test with longer context. If you were truncating inputs to fit V3.2βs 128K window, try sending full documents. The 1M window may improve output quality for your use case.
- Adjust token budgets. V4-Pro outputs tend to be slightly more verbose than V3.2. If you have strict
max_tokenslimits, verify they still produce complete responses. - Monitor costs. Run a parallel shadow deployment for a week to compare per-request costs before cutting over fully.
Breaking changes
- The
deepseek-v3.2model ID will stop working on July 24, 2026. - V4 models return a new
usage.cache_creation_tokensfield in API responses. If your parsing code uses strict schema validation, update it. - Default temperature changed from 1.0 (V3.2) to 0.6 (V4). Set temperature explicitly if your application depends on a specific value.
FAQ
Can I run V4 models locally?
V4-Flash open weights are available and can run on a single node with 4x A100 80GB or equivalent. V4-Pro weights have not been released as of April 2026. Check the V4 Pro guide for updates on weight availability.
Is V4-Flash good enough for coding tasks?
V4-Flash scores 80.3 on HumanEval+ and 47.2 on LiveCodeBench, both above V3.2. For routine code generation, refactoring, and review, V4-Flash handles the job well. For complex multi-file reasoning or very long codebases, V4-Pro is the better choice. See the V4 Flash guide for coding-specific benchmarks.
Should I switch from V3.2 now or wait?
Switch now. V3 endpoints retire in three months, and V4 models are already stable. Early migration gives you time to tune prompts and catch any behavioral differences before the deadline. Start with V4-Flash if you want a low-risk drop-in replacement, then evaluate V4-Pro for workloads that benefit from higher accuracy or longer context.
Bottom line
V4 is not an incremental update. The hybrid attention system, 1M context window, and 10x KV cache reduction represent a generational shift from V3.2. V4-Pro is the best DeepSeek model available today for accuracy-critical work. V4-Flash delivers better quality than V3.2 at a fraction of the cost.
With V3 endpoints shutting down on July 24, 2026, migration is not optional. The sooner you start testing, the smoother the transition will be. Pick your model, swap the endpoint, and verify your outputs. The V4 API guide has everything you need to get started.