Gemini offers 1M tokens. Claude offers 200K. Kimi K2.5 offers 256K. The temptation: just dump everything in and let the model figure it out.
This doesn’t work. Here’s why — and what to do instead.
Problem 1: Lost in the middle
Research consistently shows that LLMs pay most attention to the beginning and end of their context. Information in the middle gets less attention. This isn’t a minor effect — it’s a well-documented phenomenon that significantly impacts answer quality.
A 2023 Stanford/Berkeley paper (“Lost in the Middle”) demonstrated that when relevant information is placed in the middle of a long context, model accuracy drops by 20-30% compared to when the same information is at the beginning or end. This finding has been replicated across multiple model families and context lengths.
What this means in practice: a 200K context window with your answer buried at position 100K may perform worse than a 4K window with only the relevant content. The model isn’t ignoring the middle entirely — it’s just paying less attention to it. For factual retrieval tasks, this attention drop-off can be the difference between a correct and incorrect answer.
The effect is worse with certain types of information. Specific facts (dates, numbers, names) are more likely to be missed in the middle than general themes or patterns. If you’re stuffing a long context with reference material hoping the model will find a specific detail, you’re gambling against the attention curve.
Problem 2: Cost scales linearly (or worse)
Every input token costs money. Here’s what large context windows actually cost with current pricing:
| Context size | Claude Sonnet | Claude Opus | GPT-4.5 | Gemini 1.5 Pro |
|---|---|---|---|---|
| 10K tokens | $0.03 | $0.15 | $0.08 | $0.01 |
| 50K tokens | $0.15 | $0.75 | $0.38 | $0.06 |
| 100K tokens | $0.30 | $1.50 | $0.75 | $0.13 |
| 500K tokens | $1.50 | $7.50 | $3.75 | $0.63 |
| 1M tokens | $3.00 | $15.00 | $7.50 | $1.25 |
Send 100K tokens of context to Claude Opus 10 times and you’ve spent $15 — for context that’s likely 90% irrelevant to the actual question. Over a day of active development, that adds up fast.
Context engineering — selecting only what matters — is cheaper AND more effective than brute-forcing with large windows. A well-curated 10K context often outperforms a lazy 100K context both in quality and cost. See our guide on how tokenizers work to understand exactly what you’re paying for.
Problem 3: Latency increases significantly
Prefill time — the time the model spends processing your input before generating the first output token — scales with context length. This isn’t a subtle effect:
| Context size | Approximate prefill time | Time to first token |
|---|---|---|
| 4K tokens | ~0.2s | ~0.5s |
| 32K tokens | ~1.5s | ~2s |
| 100K tokens | ~5s | ~6s |
| 500K tokens | ~25s | ~27s |
| 1M tokens | ~50s | ~52s |
For interactive coding, waiting 5-6 seconds before the first token appears breaks your flow. At 100K+ tokens, the delay becomes genuinely disruptive. You’re sitting there watching a spinner while the model processes context that’s mostly irrelevant to your question.
The KV cache helps with subsequent turns in a conversation (the model doesn’t re-process cached context), but the first request with a large context always pays the full prefill cost. And if you’re making one-off requests — which is common in coding workflows — every request pays full price.
Problem 4: Quality degradation at extreme lengths
Beyond the lost-in-the-middle effect, there’s a broader quality degradation at very long contexts. Models trained on shorter sequences and fine-tuned for longer ones don’t maintain the same quality across the entire window.
Benchmarks that test “needle in a haystack” retrieval show near-perfect performance, but these are artificial tests. Real-world tasks — summarization, analysis, code understanding — show measurable quality drops as context grows:
- Summarization accuracy drops ~15% when going from 10K to 100K tokens
- Instruction following becomes less reliable with very long system prompts
- Reasoning chains are more likely to contain errors when the model is juggling large amounts of context
- Hallucination rates increase as the model has more material to confuse or conflate
This doesn’t mean long context is useless — it means you should use it deliberately, not as a default.
Problem 5: False confidence
When you give a model a huge context, it becomes more confident in its answers — even when the relevant information isn’t there. It’ll synthesize an answer from tangentially related content rather than saying “I don’t know.”
This is particularly dangerous in coding contexts. If you dump an entire codebase into the context and ask about a specific function, the model might construct a plausible-sounding answer based on patterns it sees in other parts of the code, even if the actual function behaves differently. The large context gives it enough material to be convincingly wrong.
With a smaller, curated context, the model is more likely to acknowledge uncertainty because it has less material to confabulate from. Paradoxically, giving the model less information can lead to more honest and accurate responses.
What to do instead
1. Context packing
Select only the information relevant to the current task. This is the core of context engineering — treating your context window as a precious resource and filling it with signal, not noise.
For coding tasks, this means including only the files and functions relevant to the current change, not the entire repository. Tools like Aider do this automatically with repo maps — they understand the structure of your codebase and include only what’s needed.
2. RAG (Retrieval-Augmented Generation)
Instead of pre-loading everything into context, retrieve relevant information on demand. RAG lets you search a large knowledge base and inject only the relevant results into a small context window.
A RAG pipeline with a 4K retrieval window feeding into a 32K context often outperforms dumping 200K tokens of raw documents into the context. The retrieval step acts as a quality filter, ensuring the model sees only relevant information. Build one locally with our Ollama RAG pipeline guide.
3. Summarization and compression
For long conversations or documents, summarize older content instead of keeping it verbatim. A 500-token summary of a 50K-token conversation preserves the key decisions and context while freeing up space for new information.
Hierarchical summarization works well: summarize each section independently, then create a meta-summary. The model gets the gist of everything without the token cost of everything.
4. MCP for dynamic context
Use MCP to pull in context dynamically via tools. Instead of pre-loading your database schema, API docs, and configuration files, let the model request what it needs through MCP tools. This is lazy loading for AI context — fetch on demand, not upfront.
5. Chunking and windowing
For tasks that require processing long documents, break them into chunks and process each chunk independently. Then combine the results. This is more work to implement but avoids the quality degradation of extreme context lengths.
For code review, review each file independently rather than loading the entire PR into one context. For document analysis, process each section and then synthesize findings.
When large context IS useful
Large context windows aren’t useless — they’re a tool with specific use cases:
- Long documents: Analyzing a 50-page contract, spec, or research paper where you need to understand the whole thing holistically. Chunking would lose cross-references and overall structure.
- Full codebase review: Security audits or architecture reviews where understanding the relationships between components matters more than individual file analysis.
- Multi-file refactoring: When changes span many files and the model needs to see all of them to maintain consistency.
- Conversation continuity: Long coding sessions where maintaining history of decisions and changes prevents the model from contradicting earlier work.
- Translation and localization: Processing entire documents where consistency of terminology across sections is critical.
The key: use large context windows when you genuinely need all that information simultaneously, not as a substitute for good context engineering.
The 128K sweet spot
For most practical AI coding workflows, 128K tokens is more than enough — and often more than you should use. Here’s why:
A typical codebase interaction involves 5-15 relevant files, each averaging 200-500 lines. That’s roughly 10K-30K tokens of code, plus a system prompt (1-2K tokens), conversation history (2-10K tokens), and tool results (2-5K tokens). Total: 15K-47K tokens. Well within 128K.
The cases where you genuinely need 200K+ tokens are rare: massive monorepo refactors, very long documents, or conversations that have been running for hours. For these, use the large window. For everything else, a well-chosen model with good context engineering at 32-64K tokens will give you better results at lower cost.
FAQ
Do models actually use the full context window?
They process all of it, but they don’t attend to all of it equally. The lost-in-the-middle effect means information at the beginning and end of the context gets more attention than information in the middle. For practical purposes, a model with a 200K context window might effectively use 200K tokens of storage but give meaningful attention to perhaps 60-70% of it. The remaining 30-40% is processed but with reduced attention, making it unreliable for factual retrieval. This is why curating what goes into the context matters more than how big the window is.
Is 1M context worth the cost?
For most use cases, no. The cost of sending 1M tokens per request is substantial ($1.25-$15 depending on the model), the latency is significant (30-60 seconds of prefill time), and the quality degradation at extreme lengths means you’re paying more for worse results. The exceptions are specific analytical tasks — processing entire books, analyzing complete codebases for security vulnerabilities, or legal document review where missing a clause has real consequences. For day-to-day coding and development work, 32K-128K with good context engineering is more effective and dramatically cheaper.
When should I use RAG instead of long context?
Use RAG when your knowledge base is larger than your context window, when the relevant information is a small fraction of the total, or when your knowledge base changes frequently. If you have 10,000 pages of documentation but any given question only needs 2-3 pages, RAG is dramatically more efficient than stuffing everything into context. Use long context when you need holistic understanding of a complete document, when the relationships between sections matter, or when the total content fits comfortably in the window with room to spare. In practice, most production AI systems use RAG for knowledge retrieval and reserve long context for the current working set.
Related: What is Context Engineering? · AI Context Window Explained · Build a Local RAG Pipeline with Ollama · How Tokenizers Work · AI Model Comparison · Context Packing Strategies · How to Reduce LLM API Costs · KV Cache Explained