Every token in your context window costs money and competes for the modelβs attention. Stuffing everything in doesnβt work β quality degrades as context grows. Hereβs how to manage context effectively.
The problem
| Model | Context window | At full context |
|---|---|---|
| Claude Sonnet | 200K tokens | ~$0.60/request |
| GPT-5 | 128K tokens | ~$0.64/request |
| Qwen 3.6 Plus | 1M tokens | Free (preview) |
| DeepSeek | 128K tokens | ~$0.07/request |
Filling a 200K context window with every request is expensive and often counterproductive. Models perform better with focused, relevant context than with everything dumped in.
Strategy 1: Include only whatβs relevant
The simplest optimization. Instead of sending your entire codebase, send only the files related to the task.
def get_relevant_context(task, codebase_path):
# Bad: send everything
# all_files = read_all_files(codebase_path)
# Good: send only relevant files
relevant = []
# 1. Files mentioned in the task
mentioned = extract_file_references(task)
relevant.extend(mentioned)
# 2. Files that import/depend on mentioned files
deps = get_dependencies(mentioned)
relevant.extend(deps)
# 3. Relevant test files
tests = get_test_files(mentioned)
relevant.extend(tests)
return "\n\n".join(read_file(f) for f in relevant)
This typically reduces context by 80-90% while keeping everything the model needs.
Strategy 2: Summarize old context
For long conversations or agent sessions, summarize older messages:
def manage_conversation_context(messages, max_tokens=50000):
total_tokens = count_tokens(messages)
if total_tokens <= max_tokens:
return messages
# Keep system prompt + last 10 messages
system = messages[0]
recent = messages[-10:]
old = messages[1:-10]
# Summarize old messages
summary = call_llm(f"Summarize this conversation, keeping key decisions and code changes:\n{old}")
return [
system,
{"role": "system", "content": f"Previous conversation summary:\n{summary}"},
*recent
]
Strategy 3: Hierarchical context
Give the model a high-level overview first, then details only for the relevant section:
context = f"""
## Project overview (always included)
{project_readme} # ~500 tokens
## File tree (always included)
{file_tree} # ~200 tokens
## Relevant file contents (task-specific)
{relevant_files} # ~2,000-10,000 tokens
## Recent changes (last 5 commits)
{git_log} # ~500 tokens
"""
The model uses the overview to understand the project and the detailed files to do the actual work.
Strategy 4: Prompt caching
If your system prompt is large and shared across requests, use prompt caching to avoid re-processing it:
# Anthropic prompt caching
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
system=[{
"type": "text",
"text": large_system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=user_messages,
)
Cached tokens cost 90% less on subsequent requests. If your system prompt is 10K tokens and you make 100 requests, you save ~$0.27 per batch.
Strategy 5: RAG instead of context stuffing
Instead of putting all your documentation in the context, use RAG to retrieve only the relevant parts:
# Bad: stuff all docs in context (50K tokens)
context = load_all_documentation()
# Good: retrieve relevant docs (2K tokens)
relevant_docs = vector_db.search(user_query, top_k=3)
context = "\n\n".join(relevant_docs)
See our RAG guide and vector database comparison for implementation.
How much context is too much?
Research shows model quality degrades with very long contexts, especially for information in the middle (the βlost in the middleβ problem):
| Context length | Quality | Cost |
|---|---|---|
| 1-10K tokens | β Best | Low |
| 10-50K tokens | β Good | Medium |
| 50-100K tokens | β οΈ Starts degrading | High |
| 100K+ tokens | β οΈ Middle content often ignored | Very high |
Rule of thumb: Keep context under 50K tokens for best quality. Use RAG or summarization to stay within this range.
The practical checklist
Before every LLM call, ask:
- Is everything in the context actually needed for this task?
- Can old conversation history be summarized?
- Can I use RAG instead of including all documents?
- Is the system prompt cached?
- Am I including file contents that the model wonβt use?
Cutting context by 50% typically saves 50% on costs with no quality loss β often quality improves because the model focuses on what matters.
Related: Context Engineering Explained Β· Prompt Caching Explained Β· What is RAG? Β· How to Reduce LLM API Costs Β· Agent Memory Patterns