πŸ€– AI Tools
Β· 3 min read

Context Window Management β€” How to Fit More Into Your LLM's Memory


Every token in your context window costs money and competes for the model’s attention. Stuffing everything in doesn’t work β€” quality degrades as context grows. Here’s how to manage context effectively.

The problem

ModelContext windowAt full context
Claude Sonnet200K tokens~$0.60/request
GPT-5128K tokens~$0.64/request
Qwen 3.6 Plus1M tokensFree (preview)
DeepSeek128K tokens~$0.07/request

Filling a 200K context window with every request is expensive and often counterproductive. Models perform better with focused, relevant context than with everything dumped in.

Strategy 1: Include only what’s relevant

The simplest optimization. Instead of sending your entire codebase, send only the files related to the task.

def get_relevant_context(task, codebase_path):
    # Bad: send everything
    # all_files = read_all_files(codebase_path)
    
    # Good: send only relevant files
    relevant = []
    
    # 1. Files mentioned in the task
    mentioned = extract_file_references(task)
    relevant.extend(mentioned)
    
    # 2. Files that import/depend on mentioned files
    deps = get_dependencies(mentioned)
    relevant.extend(deps)
    
    # 3. Relevant test files
    tests = get_test_files(mentioned)
    relevant.extend(tests)
    
    return "\n\n".join(read_file(f) for f in relevant)

This typically reduces context by 80-90% while keeping everything the model needs.

Strategy 2: Summarize old context

For long conversations or agent sessions, summarize older messages:

def manage_conversation_context(messages, max_tokens=50000):
    total_tokens = count_tokens(messages)
    
    if total_tokens <= max_tokens:
        return messages
    
    # Keep system prompt + last 10 messages
    system = messages[0]
    recent = messages[-10:]
    old = messages[1:-10]
    
    # Summarize old messages
    summary = call_llm(f"Summarize this conversation, keeping key decisions and code changes:\n{old}")
    
    return [
        system,
        {"role": "system", "content": f"Previous conversation summary:\n{summary}"},
        *recent
    ]

Strategy 3: Hierarchical context

Give the model a high-level overview first, then details only for the relevant section:

context = f"""
## Project overview (always included)
{project_readme}  # ~500 tokens

## File tree (always included)
{file_tree}  # ~200 tokens

## Relevant file contents (task-specific)
{relevant_files}  # ~2,000-10,000 tokens

## Recent changes (last 5 commits)
{git_log}  # ~500 tokens
"""

The model uses the overview to understand the project and the detailed files to do the actual work.

Strategy 4: Prompt caching

If your system prompt is large and shared across requests, use prompt caching to avoid re-processing it:

# Anthropic prompt caching
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    system=[{
        "type": "text",
        "text": large_system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=user_messages,
)

Cached tokens cost 90% less on subsequent requests. If your system prompt is 10K tokens and you make 100 requests, you save ~$0.27 per batch.

Strategy 5: RAG instead of context stuffing

Instead of putting all your documentation in the context, use RAG to retrieve only the relevant parts:

# Bad: stuff all docs in context (50K tokens)
context = load_all_documentation()

# Good: retrieve relevant docs (2K tokens)
relevant_docs = vector_db.search(user_query, top_k=3)
context = "\n\n".join(relevant_docs)

See our RAG guide and vector database comparison for implementation.

How much context is too much?

Research shows model quality degrades with very long contexts, especially for information in the middle (the β€œlost in the middle” problem):

Context lengthQualityCost
1-10K tokensβœ… BestLow
10-50K tokensβœ… GoodMedium
50-100K tokens⚠️ Starts degradingHigh
100K+ tokens⚠️ Middle content often ignoredVery high

Rule of thumb: Keep context under 50K tokens for best quality. Use RAG or summarization to stay within this range.

The practical checklist

Before every LLM call, ask:

  1. Is everything in the context actually needed for this task?
  2. Can old conversation history be summarized?
  3. Can I use RAG instead of including all documents?
  4. Is the system prompt cached?
  5. Am I including file contents that the model won’t use?

Cutting context by 50% typically saves 50% on costs with no quality loss β€” often quality improves because the model focuses on what matters.

Related: Context Engineering Explained Β· Prompt Caching Explained Β· What is RAG? Β· How to Reduce LLM API Costs Β· Agent Memory Patterns