🤖 AI Tools
· 10 min read

Reasonix Prefix Cache: How to Get 99% Cache Hits and Cut DeepSeek Costs 5x


Reasonix processed 435 million input tokens with a 99.82% cache hit rate. That turned a $61 bill into $12. The difference is not magic. It is deliberate prompt engineering that keeps DeepSeek’s prefix cache warm across an entire coding session.

This article explains exactly how prefix caching works, why most tools waste money by breaking it, and how Reasonix’s architecture keeps the cache stable. If you are using DeepSeek’s API for anything, understanding this mechanism will save you significant money regardless of which tool you use.

For the general overview of Reasonix, see our complete guide. For DeepSeek API setup, see the API guide.

What is prefix caching?

Prefix caching is a server-side optimization where the LLM provider stores the computed key-value (KV) cache for the beginning of your prompt. If your next request starts with the same token sequence, the provider skips recomputing those tokens and charges you a reduced rate.

DeepSeek’s implementation:

  • Cache hit rate: charged at ~$0.036/1M tokens (V4 Pro) or ~$0.007/1M tokens (V4 Flash)
  • Cache miss rate: charged at $0.435/1M tokens (V4 Pro) or $0.07/1M tokens (V4 Flash)
  • Ratio: cache hits cost 1/12th of cache misses

The cache matches on exact token prefixes. If the first 100K tokens of your request match a cached prefix exactly, those 100K tokens are served from cache. The remaining tokens (the new part) are computed fresh.

Key constraint: any change to the prefix invalidates the entire cache from that point forward. Insert a single token at position 50K, and tokens 50K through 100K all become cache misses.

How DeepSeek implements it

DeepSeek’s prefix cache operates at the infrastructure level:

  1. Per-session affinity. Requests from the same session tend to route to the same GPU cluster, where the KV cache is already resident in memory.

  2. Prefix tree storage. Common prefixes are stored in a trie structure. Multiple sessions sharing the same system prompt benefit from shared cache entries.

  3. TTL-based eviction. Cached prefixes expire after a period of inactivity (exact TTL is not publicly documented, but empirically it survives minutes of idle time within a session).

  4. Granular matching. The cache matches at token boundaries, not character boundaries. This means tokenization consistency matters.

The practical implication: if you send the same system prompt + context in the same order every turn, you only pay full price for the new user message and the model’s previous response that gets appended to history.

Why most tools break the cache

Generic coding agents like Claude Code, Aider, and Cursor are designed for model flexibility. They structure prompts for correctness and clarity, not cache stability. Common cache-breaking patterns:

1. Context reordering

Many tools sort files by relevance each turn. Turn 1 might have files in order [A, B, C]. Turn 2 reorders to [B, A, C] because B became more relevant. This invalidates the cache from the first file onward.

2. Dynamic system prompts

Tools that inject timestamps, token counts, or dynamic metadata into the system prompt break the cache on every single turn. Even a one-character change at the beginning invalidates everything after it.

3. Inconsistent formatting

If tool call results are formatted slightly differently between turns (extra whitespace, different JSON key ordering, varying indentation), the cache breaks at the point of divergence.

4. Context window management

When tools hit context limits, they often summarize or truncate from the middle. This shifts everything after the truncation point, breaking the prefix match.

5. Interleaved insertions

Adding new context (a newly read file, a tool result) in the middle of existing context rather than appending it at the end breaks the cache for everything after the insertion point.

The 3 pillars of Reasonix’s cache stability

Reasonix achieves 99.82% cache hits through three architectural decisions:

Pillar 1: Fixed prefix ordering

Reasonix structures every request with a rigid prefix order:

[System prompt] [Memory] [Project context] [Conversation history] [New message]

The system prompt never changes mid-session. Memory is loaded once at session start and only appended to (never rewritten). Project context files are added in a fixed order and never reordered.

This means the first N tokens of every request are identical to the previous request. Only the new message at the end is different.

Pillar 2: Append-only context growth

When Reasonix reads a new file or receives a tool result, it appends to the end of the context block. It never inserts content between existing context items.

Traditional approach (cache-breaking):

Turn 1: [System] [FileA] [FileB] [Message1]
Turn 2: [System] [FileA] [FileC] [FileB] [Message1] [Response1] [Message2]
                          ^^^ inserted, breaks cache from here

Reasonix approach (cache-stable):

Turn 1: [System] [FileA] [FileB] [Message1]
Turn 2: [System] [FileA] [FileB] [Message1] [Response1] [FileC] [Message2]
                                              ^^^ everything before is cached

The difference is subtle but the cost impact is massive. In the traditional approach, FileB and everything after it becomes a cache miss. In Reasonix’s approach, only FileC and Message2 are cache misses.

Pillar 3: Deterministic formatting

Every tool call, every file read, every response is formatted with byte-identical templates. No timestamps in headers. No variable whitespace. No JSON key reordering. The same file read twice produces the exact same token sequence.

This eliminates the subtle formatting variations that cause partial cache invalidation in other tools.

Real cost examples

Here is what the 99.82% cache hit rate means in practice, using V4 Flash pricing ($0.07/1M input, $0.007/1M cache hit):

Example 1: 30-minute coding session

  • Total input tokens across all turns: 500,000
  • Cache hits (99.82%): 499,100 tokens at $0.007/1M = $0.0035
  • Cache misses (0.18%): 900 tokens at $0.07/1M = $0.00006
  • Output tokens: 50,000 at $0.28/1M = $0.014
  • Total: $0.018

Without cache optimization (all misses):

  • 500,000 tokens at $0.07/1M = $0.035
  • Output: $0.014
  • Total: $0.049

Savings: 63% on this short session.

Example 2: 2-hour refactoring session

  • Total input tokens: 3,000,000
  • Cache hits (99.82%): 2,994,600 at $0.007/1M = $0.021
  • Cache misses: 5,400 at $0.07/1M = $0.0004
  • Output tokens: 300,000 at $0.28/1M = $0.084
  • Total: $0.105

Without cache optimization:

  • 3,000,000 at $0.07/1M = $0.21
  • Output: $0.084
  • Total: $0.294

Savings: 64%.

Example 3: Full day of coding (the 435M stat)

  • Total input tokens: 435,000,000
  • Cache hits (99.82%): 434,217,000 at $0.007/1M = $3.04
  • Cache misses: 783,000 at $0.07/1M = $0.055
  • Output tokens: ~30,000,000 at $0.28/1M = $8.40
  • Total: ~$11.50

Without cache optimization:

  • 435,000,000 at $0.07/1M = $30.45
  • Output: $8.40
  • Total: ~$38.85

Where does the $61 headline figure come from? The examples above use V4 Flash only. The $61 figure comes from a mixed session where ~15% of turns used V4 Pro for complex reasoning tasks. In that real-world case study:

  • 370M Flash input tokens ($0.07/1M) = $25.90
  • 65M Pro input tokens ($0.435/1M) = $28.28
  • Output tokens (Flash + Pro combined) ≈ $7.00
  • Total without cache: ~$61

With Reasonix’s cache keeping 99.82% of those input tokens cached at $0.007/1M (Flash) and $0.036/1M (Pro):

  • 369.3M cached Flash tokens ($0.007/1M) = $2.59
  • 64.9M cached Pro tokens ($0.036/1M) = $2.34
  • Cache misses (0.18%) + output costs ≈ $7.05
  • Total with cache: ~$12

The cache ratio applies equally to both models, but the absolute savings are larger on Pro because its base rate is higher.

How to monitor cache hits

Reasonix shows cache statistics in the session footer. After each turn you see:

Tokens: 245K in (99.8% cached) / 3.2K out | Cost: $0.003 | Session: $0.047

For detailed analysis, check the session transcript:

reasonix replay --last --stats

This shows per-turn cache hit rates, token counts, and cumulative costs.

You can also check the DeepSeek API response headers directly. Each response includes x-cache-hit-tokens and x-cache-miss-tokens fields that show exactly how many tokens were served from cache.

Tips for maximizing cache hits

These apply whether you use Reasonix or build your own DeepSeek integration:

1. Keep system prompts static

Never inject dynamic content (timestamps, random IDs, token counts) into your system prompt. Put dynamic content at the end of the user message instead.

2. Append, never insert

When adding new context to a conversation, always append after existing content. Never insert between existing messages or context blocks.

3. Use consistent formatting

If you format tool results as JSON, always use the same key order, same indentation, same whitespace. Consider sorting JSON keys alphabetically to ensure determinism.

4. Maintain long sessions

Cache hits improve as sessions grow because the stable prefix gets longer. A 10-turn conversation has a much higher cache ratio than 10 separate 1-turn conversations.

5. Front-load stable context

Put the most stable content (system prompt, project conventions, architecture docs) at the beginning. Put volatile content (current file being edited, recent changes) at the end.

6. Avoid unnecessary context clearing

Every time you clear context and start fresh, you lose the warm cache. Only clear when genuinely necessary (switching to an unrelated task, hitting context limits).

7. Use memory files

Reasonix’s memory file (.reasonix/memory.md) loads at a fixed position in the prefix. It provides stable context that contributes to the cached prefix without needing to be re-sent differently each turn.

Comparison with other tools’ caching

ToolCache strategyTypical hit rateCost impact
ReasonixPrefix-stable architecture99.82%5x savings
Claude CodeAnthropic’s automatic caching~60-80%1.5-2x savings
AiderNo explicit cache optimization~30-50%Minimal savings
CursorServer-side, opaqueUnknownUnknown
OpenCodeNo cache optimization~30-50%Minimal savings

Anthropic’s prompt caching (used by Claude Code) is good but not as aggressive as DeepSeek’s, and Claude Code does not optimize prompt structure for cache stability the way Reasonix does.

Aider and OpenCode are model-agnostic tools that do not optimize for any provider’s caching. They get whatever cache hits happen naturally from conversation history, but actively break the cache through context reordering and dynamic formatting.

When cache optimization matters less

Cache optimization has diminishing returns in certain scenarios:

  • Very short sessions (1-3 turns): Not enough conversation history to build a meaningful cached prefix.
  • Output-heavy workloads: If your workload generates far more output than input, the cache savings on input are a smaller percentage of total cost.
  • Frequent context switches: If you are jumping between unrelated tasks constantly, the cache cannot build up.

For these cases, the model choice (Flash vs Pro) and the permanent pricing discount matter more than cache optimization.

Building your own cache-stable integration

If you are building a custom DeepSeek integration (not using Reasonix), apply these principles:

// Good: stable prefix, append-only growth
const messages = [
  { role: "system", content: STATIC_SYSTEM_PROMPT }, // never changes
  { role: "user", content: STATIC_PROJECT_CONTEXT }, // loaded once
  ...conversationHistory, // grows by appending
  { role: "user", content: newMessage } // only new content
];

// Bad: dynamic prefix, reordering
const messages = [
  { role: "system", content: `${PROMPT} Updated: ${Date.now()}` }, // breaks cache every turn
  { role: "user", content: sortByRelevance(files) }, // reorders, breaks cache
  ...conversationHistory,
  { role: "user", content: newMessage }
];

The difference between these two patterns is the difference between paying $12 and paying $61 for the same work.

For more on DeepSeek’s API capabilities, see our V4 API guide. For the cheapest possible DeepSeek usage, combine cache optimization with V4 Flash.

FAQ

Does prefix caching work with V4 Pro and V4 Flash equally?

Yes. Both models support prefix caching with the same mechanism. The cache hit discount ratio is similar (roughly 1/12th of the standard input rate). The absolute savings are larger with Pro because its base input rate is higher ($0.435 vs $0.07 per 1M tokens).

How long does the cache stay warm?

DeepSeek does not publish exact TTL values. Empirically, the cache survives several minutes of inactivity within a session. If you step away for 5-10 minutes and come back, the cache is usually still warm. Extended breaks (30+ minutes) may result in cache eviction.

Can I share cache across multiple sessions?

Partially. If multiple sessions use the same system prompt, DeepSeek’s prefix tree may serve that shared prefix from cache. However, the conversation-specific context after the system prompt will not be shared. Reasonix’s memory file helps here because it creates a longer shared prefix across sessions on the same project.

Does cache optimization conflict with context window limits?

No. Cache optimization is about ordering and stability, not about using more or fewer tokens. You still need to manage context window limits, but you do so by appending and eventually truncating from the end rather than summarizing from the middle.

What happens when the context window fills up?

Reasonix handles this by truncating the oldest conversation turns (from the middle of the message history, not the stable prefix). The system prompt, memory, and project context remain intact at the beginning, preserving the cached prefix. Only the conversation history gets trimmed.

Is 99.82% cache hit rate realistic for my use case?

For interactive coding sessions longer than 5-10 turns, yes. The rate improves as sessions grow. Short sessions (1-3 turns) will see lower rates (80-95%) because the new content is a larger proportion of total tokens. The 99.82% figure comes from extended multi-hour sessions where the stable prefix dominates.

Can I verify cache hits in the DeepSeek API response?

Yes. DeepSeek’s API responses include usage fields that break down cached vs non-cached input tokens. Look for prompt_cache_hit_tokens and prompt_cache_miss_tokens in the usage object of the response.