DeepSeek V4 ships with a 1 million token context window on every tier. Not a premium add-on. Not a beta feature. It is the default for both V4-Pro and V4-Flash, available at no extra cost compared to shorter-context models. If you have API access, you have the full million tokens.
This guide breaks down how the architecture makes that possible, what actually fits inside 1M tokens, and where the model excels (or struggles) at long-range retrieval. For a broader look at the V4-Pro model itself, see the DeepSeek V4 Pro complete guide.
What fits in 1 million tokens
A million tokens translates to roughly:
- ~750,000 words of English text
- ~1,500 pages of standard documents (500 words per page)
- An entire mid-size codebase (think 50,000+ lines across hundreds of files)
- Full Git repositories, including history diffs, README files, and test suites
- Multiple books loaded simultaneously for cross-referencing
- Months of chat logs or agent interaction history
For context on how token limits compare across models, see our AI context window explained overview.
To put it practically: you can load an entire monorepo into a single prompt and ask V4 to trace a bug across service boundaries. Or feed it a 1,200-page legal contract and ask about clause interactions on page 900. The constraint shifts from โwhat fitsโ to โwhat is the model actually good at retrieving.โ
Architecture: CSA + HCA interleaved attention
The million-token window is not brute force. DeepSeek V4 uses two complementary attention mechanisms, interleaved across layers:
Compressed Sparse Attention (CSA)
CSA handles fine-grained retrieval. Instead of attending to every token in the sequence, it performs a top-k sparse lookup over compressed key-value representations. Think of it as a search index over the context: the model identifies which chunks of the input are most relevant to the current query, then attends densely only to those chunks.
This keeps compute sub-quadratic for most layers while preserving the ability to pull exact details from anywhere in the context.
Heavy Compressed Attention (HCA)
HCA provides the global view. It aggressively compresses the full context into a much smaller set of summary representations, then attends over those summaries. This gives every layer access to a birdโs-eye view of the entire input, even if the fine-grained details are handled by CSA.
The compression ratio is high (roughly 64:1 in most configurations), which is what makes the KV cache so small relative to standard attention.
How they work together
CSA and HCA layers alternate throughout the network. A typical forward pass looks like:
- HCA layer compresses the full context into global summaries
- CSA layer does sparse retrieval over the detailed representations
- Repeat, with each layer refining based on what previous layers surfaced
This interleaving means the model never relies on just one strategy. It always has both a compressed global map and a sparse fine-grained index available.
Efficiency gains over V3.2
The architectural changes translate directly into compute and memory savings:
| Metric | V4-Pro vs V3.2 | V4-Flash vs V3.2 |
|---|---|---|
| FLOPs (at 1M context) | 27% of V3.2 | 10% of V3.2 |
| KV cache size | 10% of V3.2 | 7% of V3.2 |
| Latency per token (1M) | ~3x faster | ~6x faster |
V4-Flash is the efficiency champion here. At 7% KV cache and 10% FLOPs, it can serve million-token requests on hardware that would choke on V3.2 at 256K. V4-Pro trades some of that efficiency for stronger retrieval accuracy, but still runs at roughly a quarter of V3.2โs compute budget.
For a comparison of how these efficiency numbers stack up against Googleโs latest, see DeepSeek V4 vs Gemini 3.1 Pro.
Retrieval accuracy: MRCR benchmark
Raw context length means nothing if the model cannot actually use it. The Multi-Range Context Retrieval (MRCR) benchmark tests whether models can find specific facts buried at various depths in long inputs.
| Context length | V4-Pro accuracy | V4-Flash accuracy |
|---|---|---|
| 128K tokens | 94% | 89% |
| 512K tokens | 82% | 74% |
| 1M tokens | 66% | 58% |
At 128K, V4-Pro is near-perfect. Performance degrades as context grows, which is expected with any sparse attention approach. The 66% accuracy at 1M tokens means roughly one-third of needle-in-a-haystack queries will miss. This is a real limitation worth planning around.
Practical takeaway: If your task requires guaranteed retrieval of a specific detail from a million-token input, consider chunking and re-ranking rather than relying on a single pass. For strategies on handling this, check out our context window management guide.
CorpusQA: long-document question answering
CorpusQA tests a different skill: answering questions that require synthesizing information across a long document, not just retrieving a single fact.
| Model | CorpusQA accuracy (1M tokens) |
|---|---|
| Claude Opus 4 | 71.7% |
| DeepSeek V4-Pro | 62.0% |
| Gemini 2.5 Pro | 53.8% |
| GPT-5 | 51.2% |
V4-Pro lands solidly in second place. It beats Gemini by over 8 points but trails Opus by nearly 10. For tasks that require deep synthesis across very long inputs, Opus remains the stronger choice. For everything else, V4-Pro offers a compelling balance of cost, speed, and accuracy.
Practical use cases
Repo-level coding
Load an entire repository (source files, tests, configs, docs) into a single prompt. Ask V4 to:
- Trace how a function is called across multiple services
- Identify dead code or unused imports across the full codebase
- Generate a migration plan that accounts for all downstream dependencies
- Write integration tests that reference actual helper functions and types
This is where the million-token window changes workflows most dramatically. No more manually selecting which files to include.
Long document analysis
Legal contracts, research papers, financial filings, technical specifications. Load the full document and ask questions that span sections. V4-Pro handles cross-referencing well, though retrieval of specific clauses deep in the document may require explicit page or section references in your prompt.
Agent loops and extended conversations
Autonomous agents that run for hundreds of turns accumulate massive context. With 1M tokens, an agent can maintain its full interaction history, tool call results, and reasoning traces without summarization or truncation. This reduces the โamnesiaโ problem that plagues agents on shorter-context models.
YaRN: extending beyond 1M
DeepSeek V4 natively supports up to 1.01M tokens. For inputs that push right up against that boundary, the model uses YaRN (Yet another RoPE extensioN) to interpolate positional encodings beyond the training distribution.
YaRN works by rescaling the rotary position embeddings so that positions beyond the training length map into the learned range. It is not a free lunch: accuracy degrades faster in the extended region compared to the native range. But it provides a soft boundary rather than a hard cutoff, which is useful for edge cases where your input is just slightly over 1M.
In practice, most users will not need to think about YaRN. The 1.01M native limit covers nearly all realistic use cases. It matters most for pipeline builders who need to handle variable-length inputs without pre-truncation logic.
FAQ
Does the 1M context window cost more than shorter contexts?
No. The million-token context is the default for both V4-Pro and V4-Flash. You pay per token as usual, but there is no premium tier or feature flag to enable it. If your prompt is 1M tokens, you pay for 1M input tokens at the standard rate.
Should I always use the full 1M tokens if I can?
Not necessarily. Retrieval accuracy drops at longer contexts (66% at 1M vs 94% at 128K on MRCR). If your task only needs 200K tokens of context, sending 200K will give you better accuracy and lower latency. Use the full window when you genuinely need the breadth, not as a default.
How does V4โs context window compare to Gemini 3.1 Pro?
Both offer million-token context, but the architectures differ. V4 uses CSA+HCA interleaving while Gemini uses a different sparse attention variant. On CorpusQA at 1M tokens, V4-Pro scores 62.0% vs Geminiโs 53.8%. On raw retrieval (MRCR), the gap is narrower. See our full comparison for detailed benchmarks.