May 15, 2026 · 3 min read

Context Window Management — How to Fit More Into Your LLM's Memory

Every token in your context window costs money and competes for the model’s attention. Stuffing everything in doesn’t work — quality degrades as context grows. Here’s how to manage context effectively.

The problem

Model	Context window	At full context
Claude Sonnet	200K tokens	~$0.60/request
GPT-5	128K tokens	~$0.64/request
Qwen 3.6 Plus	1M tokens	Free (preview)
DeepSeek	128K tokens	~$0.07/request

Filling a 200K context window with every request is expensive and often counterproductive. Models perform better with focused, relevant context than with everything dumped in.

Strategy 1: Include only what’s relevant

The simplest optimization. Instead of sending your entire codebase, send only the files related to the task.

def get_relevant_context(task, codebase_path):
    # Bad: send everything
    # all_files = read_all_files(codebase_path)
    
    # Good: send only relevant files
    relevant = []
    
    # 1. Files mentioned in the task
    mentioned = extract_file_references(task)
    relevant.extend(mentioned)
    
    # 2. Files that import/depend on mentioned files
    deps = get_dependencies(mentioned)
    relevant.extend(deps)
    
    # 3. Relevant test files
    tests = get_test_files(mentioned)
    relevant.extend(tests)
    
    return "\n\n".join(read_file(f) for f in relevant)

This typically reduces context by 80-90% while keeping everything the model needs.

Strategy 2: Summarize old context

For long conversations or agent sessions, summarize older messages:

def manage_conversation_context(messages, max_tokens=50000):
    total_tokens = count_tokens(messages)
    
    if total_tokens <= max_tokens:
        return messages
    
    # Keep system prompt + last 10 messages
    system = messages[0]
    recent = messages[-10:]
    old = messages[1:-10]
    
    # Summarize old messages
    summary = call_llm(f"Summarize this conversation, keeping key decisions and code changes:\n{old}")
    
    return [
        system,
        {"role": "system", "content": f"Previous conversation summary:\n{summary}"},
        *recent
    ]

Strategy 3: Hierarchical context

Give the model a high-level overview first, then details only for the relevant section:

context = f"""
## Project overview (always included)
{project_readme}  # ~500 tokens

## File tree (always included)
{file_tree}  # ~200 tokens

## Relevant file contents (task-specific)
{relevant_files}  # ~2,000-10,000 tokens

## Recent changes (last 5 commits)
{git_log}  # ~500 tokens
"""

The model uses the overview to understand the project and the detailed files to do the actual work.

Strategy 4: Prompt caching

If your system prompt is large and shared across requests, use prompt caching to avoid re-processing it:

# Anthropic prompt caching
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    system=[{
        "type": "text",
        "text": large_system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=user_messages,
)

Cached tokens cost 90% less on subsequent requests. If your system prompt is 10K tokens and you make 100 requests, you save ~$0.27 per batch.

Strategy 5: RAG instead of context stuffing

Instead of putting all your documentation in the context, use RAG to retrieve only the relevant parts:

# Bad: stuff all docs in context (50K tokens)
context = load_all_documentation()

# Good: retrieve relevant docs (2K tokens)
relevant_docs = vector_db.search(user_query, top_k=3)
context = "\n\n".join(relevant_docs)

See our RAG guide and vector database comparison for implementation.

How much context is too much?

Research shows model quality degrades with very long contexts, especially for information in the middle (the “lost in the middle” problem):

Context length	Quality	Cost
1-10K tokens	✅ Best	Low
10-50K tokens	✅ Good	Medium
50-100K tokens	⚠️ Starts degrading	High
100K+ tokens	⚠️ Middle content often ignored	Very high

Rule of thumb: Keep context under 50K tokens for best quality. Use RAG or summarization to stay within this range.

The practical checklist

Before every LLM call, ask:

Is everything in the context actually needed for this task?
Can old conversation history be summarized?
Can I use RAG instead of including all documents?
Is the system prompt cached?
Am I including file contents that the model won’t use?

Cutting context by 50% typically saves 50% on costs with no quality loss — often quality improves because the model focuses on what matters.

Related: Context Engineering Explained · Prompt Caching Explained · What is RAG? · How to Reduce LLM API Costs · Agent Memory Patterns

Context Window Management — How to Fit More Into Your LLM's Memory

The problem

Strategy 1: Include only what’s relevant

Strategy 2: Summarize old context

Strategy 3: Hierarchical context

Strategy 4: Prompt caching

Strategy 5: RAG instead of context stuffing

How much context is too much?

The practical checklist

📬 AI Dev Weekly

You might also like

How to Test AI Applications — A Developer's Guide to LLM Evaluation

LLM Observability for Developers — How to Monitor AI Apps in Production

Why Parsing LLM Output Keeps Breaking Your App

How to Monitor and Control AI API Spending — Stop the Surprise Bills