Apr 24, 2026 · 6 min read

DeepSeek V4 vs V3: What Changed and Should You Upgrade? (2026)

DeepSeek V4 landed in April 2026 with two variants: V4-Pro and V4-Flash. Both replace V3.2 with larger capacity, longer context, and a redesigned attention mechanism. If you are running V3.2 in production, the clock is ticking. V3 endpoints retire on July 24, 2026.

This guide breaks down every difference between V3.2 and the V4 family so you can decide which model fits your workload and plan your migration. For deeper dives, see the V4 Pro guide and the V4 Flash guide.

Model specs at a glance

Spec	V3.2	V4-Pro	V4-Flash
Total parameters	671B	1.6T	284B
Active parameters	37B	49B	13B
Max context window	128K tokens	1M tokens	1M tokens
Expert precision	FP8	FP4	FP4
Attention type	MLA	CSA + HCA hybrid	CSA + HCA hybrid
Optimizer	AdamW	Muon	Muon
Release date	2025-09	2026-04	2026-04

V4-Pro is the flagship. It more than doubles total parameter count while only bumping active parameters from 37B to 49B, keeping per-token compute manageable. V4-Flash targets cost-sensitive workloads with just 13B active parameters and a smaller total footprint than V3.2 itself.

Both V4 models push the context window from 128K to 1M tokens, which is the headline feature for anyone working with large codebases or long documents.

Architecture changes

Hybrid attention: CSA + HCA

V3.2 used Multi-head Latent Attention (MLA) across all layers. V4 replaces this with a hybrid of two new mechanisms:

Chunked Sliding-window Attention (CSA) handles local context. It processes tokens in fixed-size chunks with a sliding window, reducing memory use for nearby tokens.
Hierarchical Condensed Attention (HCA) handles global context. It compresses distant tokens into summary representations, letting the model attend over 1M tokens without quadratic cost.

The combination means V4 can process long sequences efficiently while still capturing fine-grained local patterns. This is the core reason the 1M context window is practical rather than theoretical.

Multi-head Condensation (mHC)

mHC sits between CSA and HCA layers. It pools attention heads into condensed representations before passing them to the global attention stage. This reduces the number of key-value pairs HCA needs to process, which is the main driver behind the KV cache savings described below.

Muon optimizer

V4 training switched from AdamW to the Muon optimizer. Muon uses momentum-based updates with unit-norm constraints, which stabilized training at the 1.6T parameter scale. DeepSeek reported fewer loss spikes and faster convergence compared to their V3.2 training runs.

FP4 experts

All Mixture-of-Experts layers in V4 use FP4 quantized weights, down from FP8 in V3.2. This cuts expert memory footprint roughly in half, which is how V4-Pro fits 1.6T total parameters on the same serving infrastructure that ran V3.2’s 671B.

Efficiency gains

The architecture changes translate directly into compute and memory savings, especially at long context lengths.

Metric (at 1M context)	V4-Pro vs V3.2	V4-Flash vs V3.2
FLOPs per token	27% of V3.2	10% of V3.2
KV cache size	10% of V3.2	7% of V3.2

V4-Pro uses only 27% of the FLOPs V3.2 would need at 1M context, and V4-Flash drops that to 10%. KV cache reduction is even more dramatic: V4-Pro needs just 10% of V3.2’s cache, and V4-Flash needs 7%. This is what makes 1M-token inference viable on current hardware.

At shorter context lengths (under 32K), the efficiency gap narrows. V4-Pro runs at roughly similar cost to V3.2 for short prompts, while V4-Flash remains cheaper across all context lengths due to its smaller active parameter count.

Benchmark improvements

Benchmark	V3.2	V4-Pro	V4-Flash
MMLU-Pro	75.9	82.4	78.1
HumanEval+	76.2	85.7	80.3
LiveCodeBench (2026-Q1)	41.8	54.6	47.2
MATH-500	82.1	89.3	85.0
GPQA-Diamond	46.3	58.1	51.7
LongBench v2 (128K)	68.4	79.2	74.8
RULER (1M)	N/A	91.3	88.6

V4-Pro leads across every benchmark. The biggest jumps are in coding (LiveCodeBench +12.8 points) and long-context tasks (LongBench +10.8 points). V4-Flash consistently lands between V3.2 and V4-Pro, making it a solid middle ground.

The RULER 1M benchmark is new for V4 since V3.2 could not handle that context length. Both V4 models score above 88, confirming the 1M window is functional and not just a spec sheet number.

For a comparison with reasoning models, see V4 vs R1.

Pricing comparison

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cache hits (per 1M tokens)
V3.2	$0.27	$1.10	$0.07
V4-Pro	$0.40	$1.60	$0.10
V4-Flash	$0.10	$0.40	$0.03

V4-Pro costs roughly 45% more than V3.2 per token, but the efficiency gains at long context lengths can offset that. If your average prompt exceeds 64K tokens, V4-Pro may actually cost less per request than V3.2 due to lower compute overhead.

V4-Flash is the budget option. At $0.10 per million input tokens, it undercuts V3.2 by more than 60% while delivering better benchmark scores. For high-volume workloads that do not need peak accuracy, V4-Flash is the obvious pick.

Full API details, rate limits, and endpoint configuration are covered in the V4 API guide.

Migration guide

Timeline

DeepSeek announced that all V3.x API endpoints will be retired on July 24, 2026. After that date, requests to deepseek-v3 or deepseek-v3.2 model identifiers will return 404 errors.

Steps to migrate

Pick your model. Use deepseek-v4-pro for tasks requiring maximum accuracy or long context. Use deepseek-v4-flash for cost-sensitive or latency-sensitive workloads.
Update model identifiers. Replace deepseek-v3 or deepseek-v3.2 with deepseek-v4-pro or deepseek-v4-flash in your API calls.
Test with longer context. If you were truncating inputs to fit V3.2’s 128K window, try sending full documents. The 1M window may improve output quality for your use case.
Adjust token budgets. V4-Pro outputs tend to be slightly more verbose than V3.2. If you have strict max_tokens limits, verify they still produce complete responses.
Monitor costs. Run a parallel shadow deployment for a week to compare per-request costs before cutting over fully.

Breaking changes

The deepseek-v3.2 model ID will stop working on July 24, 2026.
V4 models return a new usage.cache_creation_tokens field in API responses. If your parsing code uses strict schema validation, update it.
Default temperature changed from 1.0 (V3.2) to 0.6 (V4). Set temperature explicitly if your application depends on a specific value.

FAQ

Can I run V4 models locally?

V4-Flash open weights are available and can run on a single node with 4x A100 80GB or equivalent. V4-Pro weights have not been released as of April 2026. Check the V4 Pro guide for updates on weight availability.

Is V4-Flash good enough for coding tasks?

V4-Flash scores 80.3 on HumanEval+ and 47.2 on LiveCodeBench, both above V3.2. For routine code generation, refactoring, and review, V4-Flash handles the job well. For complex multi-file reasoning or very long codebases, V4-Pro is the better choice. See the V4 Flash guide for coding-specific benchmarks.

Should I switch from V3.2 now or wait?

Switch now. V3 endpoints retire in three months, and V4 models are already stable. Early migration gives you time to tune prompts and catch any behavioral differences before the deadline. Start with V4-Flash if you want a low-risk drop-in replacement, then evaluate V4-Pro for workloads that benefit from higher accuracy or longer context.

Bottom line

V4 is not an incremental update. The hybrid attention system, 1M context window, and 10x KV cache reduction represent a generational shift from V3.2. V4-Pro is the best DeepSeek model available today for accuracy-critical work. V4-Flash delivers better quality than V3.2 at a fraction of the cost.

With V3 endpoints shutting down on July 24, 2026, migration is not optional. The sooner you start testing, the smoother the transition will be. Pick your model, swap the endpoint, and verify your outputs. The V4 API guide has everything you need to get started.