πŸ€– AI Tools
Β· 11 min read

Mistral Medium 3.5 Token Efficiency β€” How to Optimize Costs and Speed (2026)


Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens. That is 5x cheaper than Claude 3.5 Sonnet on input and roughly comparable on output. But raw per-token pricing only tells part of the story. The real cost depends on how you use the model β€” and Mistral gives you more levers to pull than most providers.

Configurable reasoning effort, prompt caching at 90% discount, batch API processing, and EAGLE speculative decoding for self-hosted deployments. Used correctly, these features can cut your effective costs by 40-70% compared to naive API usage.

This guide covers every optimization available, with concrete numbers and configuration examples.

For general model capabilities and setup, see the Mistral Medium 3.5 complete guide. For API integration details, see the Mistral Medium 3.5 API guide.

Pricing breakdown

Base API pricing

Token typePrice per million tokens
Input tokens$1.50
Output tokens$7.50
Cached input tokens$0.15
Batch API input$0.75
Batch API output$3.75

Output tokens cost 5x more than input tokens. This is the single most important fact for cost optimization. Every strategy that reduces output token count has an outsized impact on your bill.

Monthly cost estimates by usage tier

Usage levelInput tokens/moOutput tokens/moBase costOptimized cost
Light (individual dev)5M1M$15$5-8
Medium (small team)50M10M$150$50-80
Heavy (production app)500M100M$1,500$400-700
Enterprise (high volume)5B1B$15,000$4,000-7,000

The β€œoptimized cost” column assumes you implement the strategies in this guide: reasoning effort tuning, prompt caching, batch processing where applicable, and prompt optimization. The range depends on your workload characteristics.

Reasoning effort optimization

This is the highest-impact optimization available. Mistral Medium 3.5 supports configurable reasoning effort through the reasoning_effort parameter, which controls how much compute the model spends on chain-of-thought reasoning before generating a response.

Available reasoning levels

LevelBehaviorToken overheadBest for
noneNo chain-of-thought reasoning0% overheadClassification, extraction, formatting
lowMinimal reasoning~20% overheadSimple Q&A, summarization
mediumBalanced reasoning (default)~50% overheadGeneral tasks, content generation
highExtended reasoning~100-200% overheadComplex coding, math, multi-step analysis

How reasoning effort affects costs

When reasoning effort is set to high, the model generates internal reasoning tokens before producing the visible output. These reasoning tokens count toward your output token usage. On complex tasks, reasoning tokens can be 2-3x the visible output length.

Setting reasoning effort to none for simple tasks eliminates this overhead entirely. The model skips chain-of-thought and generates the answer directly.

from mistralai import Mistral

client = Mistral(api_key="your-key")

# Simple classification β€” no reasoning needed
response = client.chat.complete(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "Classify this support ticket as billing, technical, or general: 'I can't log in to my account'"}],
    reasoning_effort="none"
)

# Complex code review β€” full reasoning
response = client.chat.complete(
    model="mistral-medium-3.5",
    messages=[{"role": "user", "content": "Review this PR for security vulnerabilities and suggest fixes: ..."}],
    reasoning_effort="high"
)

Practical savings from reasoning effort tuning

For a typical enterprise workload with a mix of task types:

  • 40% simple tasks (classification, extraction, formatting): Switch to none β†’ save ~95% of reasoning tokens on these tasks
  • 30% moderate tasks (summarization, Q&A): Switch to low β†’ save ~60% of reasoning tokens
  • 30% complex tasks (coding, analysis): Keep at high β†’ no savings, but better quality

Net result: 40-60% reduction in total output tokens compared to running everything at medium or high.

The key is building a routing layer that classifies incoming requests and sets the appropriate reasoning effort. A simple keyword-based classifier or a small model (Mistral Small) can handle this routing at negligible cost.

Prompt caching strategies

Mistral’s prompt caching gives you a 90% discount on input tokens that match a previously cached prompt prefix. This is transformative for workloads with repeated system prompts or shared context.

How prompt caching works

When you send a request, Mistral checks if the beginning of your prompt matches a cached prefix from a recent request. If it does, those tokens are charged at $0.15 per million instead of $1.50 per million.

Cache hits require:

  • Same model
  • Same account
  • Identical token sequence from the start of the prompt
  • Request within the cache TTL (typically 5-10 minutes)

Maximizing cache hit rates

Structure your prompts with the most stable content first:

[System prompt β€” same across all requests]     ← cached
[Shared context β€” same within a session]       ← cached
[Few-shot examples β€” same for task type]       ← cached
[User-specific input β€” varies per request]     ← not cached

The longer your cached prefix, the bigger the savings. A 10,000-token system prompt that hits cache on every request saves $13.50 per million requests compared to uncached pricing.

Caching strategies by workload type

Chatbot/assistant: Place your system prompt and persona instructions first. These stay constant across all conversations. Cache hit rate: 80-95%.

Document processing pipeline: Batch documents that share the same extraction schema. Send the schema as the system prompt, then individual documents as user messages. Cache hit rate: 90%+.

Code review automation: Use a fixed system prompt with coding standards and review criteria. The prompt stays cached across all PRs. Cache hit rate: 85-95%.

RAG applications: This is trickier. Retrieved context changes per query, which breaks caching. Structure your prompt so the system instructions and few-shot examples come before the retrieved chunks. You will cache the instruction prefix but not the retrieved content.

For more on caching strategies across providers, see prompt caching explained.

Batch API for non-real-time work

Mistral’s batch API processes requests asynchronously at 50% of standard pricing. Requests are queued and processed within a 24-hour window.

When to use batch API

  • Nightly data processing pipelines
  • Bulk content generation
  • Dataset labeling and classification
  • Code analysis across large repositories
  • Any workload where latency is not critical

Batch API pricing

Token typeStandardBatchSavings
Input$1.50/M$0.75/M50%
Output$7.50/M$3.75/M50%

Implementation example

import json
from mistralai import Mistral

client = Mistral(api_key="your-key")

# Prepare batch requests
requests = []
for i, document in enumerate(documents):
    requests.append({
        "custom_id": f"doc-{i}",
        "model": "mistral-medium-3.5",
        "messages": [
            {"role": "system", "content": "Extract key entities from this document."},
            {"role": "user", "content": document}
        ],
        "reasoning_effort": "none"
    })

# Write JSONL file
with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and create batch job
batch_file = client.files.upload(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batch.create(input_file_id=batch_file.id, model="mistral-medium-3.5")

# Poll for completion
# Results available within 24 hours

Combining batch API with reasoning_effort="none" for simple extraction tasks gives you the maximum discount: 50% from batch pricing plus eliminated reasoning overhead.

EAGLE speculative decoding for self-hosted deployments

If you self-host Mistral Medium 3.5, EAGLE speculative decoding can increase inference speed by 2-3x without quality loss.

How EAGLE works

EAGLE uses a small draft model to predict multiple tokens ahead, then verifies them with the full model in a single forward pass. When predictions are correct (which happens frequently for common patterns), you get multiple tokens for the cost of one forward pass.

Mistral provides EAGLE draft model weights alongside the main model weights. Setup with vLLM:

vllm serve mistral-medium-3.5 \
    --speculative-model mistral-medium-3.5-eagle \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 4

Performance impact

MetricWithout EAGLEWith EAGLE
Tokens/second~45~110
Time to first token~200ms~250ms
Total throughput1x2.4x

EAGLE increases time-to-first-token slightly (the draft model adds a small overhead) but dramatically improves tokens-per-second for generation. For long outputs, the net effect is significantly faster responses.

This does not reduce cost per token β€” you are still using the same hardware. But it increases throughput per GPU, which means you need fewer GPUs to serve the same request volume. That translates to lower infrastructure costs.

API vs self-hosting break-even analysis

At what point does self-hosting become cheaper than the API?

Cost model

API cost: Directly proportional to token volume. No fixed costs.

Self-hosting cost: Fixed infrastructure cost (GPU rental or purchase) plus operational overhead (engineering time, monitoring, updates).

Break-even calculation

Assuming 4x A100 80GB on a cloud provider at ~$12/hour ($8,640/month):

Monthly token volumeAPI costSelf-host costWinner
50M input + 10M output$150$8,640API
500M input + 100M output$1,500$8,640API
2B input + 400M output$6,000$8,640API
5B input + 1B output$15,000$8,640Self-host
10B input + 2B output$30,000$8,640Self-host

The break-even point is roughly 3-4 billion input tokens + 500M-700M output tokens per month, depending on your GPU costs and utilization rate.

Below that volume, the API is cheaper and requires zero infrastructure management. Above it, self-hosting saves money β€” but you need engineering capacity to manage the deployment.

For more on managing AI infrastructure costs, see AI agent cost management.

Tips for reducing token usage

Beyond the major optimizations above, these smaller techniques compound:

1. Compress system prompts

Most system prompts are verbose. A 2,000-token system prompt that could be 500 tokens wastes 1,500 tokens per request. At 1M requests/month, that is 1.5B wasted input tokens ($2,250 at standard pricing, $225 with caching).

Audit your system prompts. Remove redundant instructions. Use concise language. Test whether shorter prompts produce equivalent output quality.

2. Set max_tokens appropriately

If you know the expected output length, set max_tokens to a reasonable limit. This prevents the model from generating unnecessarily long responses. A classification task needs 10 tokens, not 1,000.

3. Use structured output

Request JSON output with a defined schema. This produces shorter, more predictable responses than free-form text. Mistral Medium 3.5 supports JSON mode natively.

response = client.chat.complete(
    model="mistral-medium-3.5",
    messages=[...],
    response_format={"type": "json_object"}
)

Instead of sending 10 separate API calls for 10 documents, send one request with all 10 documents and ask for a combined response. This shares the system prompt cost across all documents and reduces per-request overhead.

5. Cache responses application-side

If the same query appears frequently, cache the response in your application layer (Redis, in-memory cache). This eliminates the API call entirely for repeated queries.

For more strategies on reducing LLM costs across providers, see how to reduce LLM API costs.

Cost comparison with alternatives

How does optimized Mistral Medium 3.5 compare to other models?

ModelInput $/MOutput $/MOptimized monthly cost (medium usage)
Mistral Medium 3.5$1.50$7.50$50-80
Claude 3.5 Sonnet$3.00$15.00$120-180
GPT-4o$2.50$10.00$100-150
Gemini 1.5 Pro$1.25$5.00$45-70
DeepSeek V3$0.27$1.10$15-25
Llama 3.3 70B (self-hosted)~$0.50*~$2.00*$25-40

*Self-hosted costs are approximate and depend on infrastructure.

Mistral Medium 3.5 is not the cheapest option. DeepSeek V3 and self-hosted Llama are significantly cheaper per token. But Mistral offers a better quality-to-cost ratio than Claude or GPT-4o, with the added benefits of EU data residency and open weights.

The right choice depends on your quality requirements. If Mistral Medium 3.5’s benchmark performance meets your needs, it is one of the most cost-effective frontier models available β€” especially with the optimizations in this guide.

FAQ

How much can I realistically save with reasoning effort tuning?

40-60% on total output token costs for mixed workloads. The savings depend on your task distribution. If 80% of your requests are simple (classification, extraction, formatting), you can save more. If most requests are complex coding or analysis tasks that need full reasoning, savings will be lower. Start by auditing your request types and categorizing them by complexity.

Does prompt caching work across different users?

Yes, as long as the requests share the same prompt prefix and are on the same account. The cache matches on the exact token sequence from the start of the prompt. Different users sending requests with the same system prompt will benefit from caching. The variable user input at the end of the prompt does not affect caching of the shared prefix.

Is the batch API suitable for production workloads?

Only for non-latency-sensitive workloads. Batch requests are processed within a 24-hour window β€” you cannot guarantee when results will be ready. Use it for overnight processing, bulk analysis, dataset preparation, and similar tasks. For real-time user-facing applications, use the standard API.

When should I self-host instead of using the API?

When your monthly token volume exceeds roughly 3-4 billion input tokens plus 500-700 million output tokens. Below that, the API is cheaper and requires no infrastructure management. Above it, self-hosting on 4x A100/H100 GPUs becomes more cost-effective. Also consider self-hosting if you need data sovereignty guarantees beyond what the API provides, or if you need custom model modifications.

Does EAGLE speculative decoding affect output quality?

No. EAGLE is mathematically equivalent to standard decoding. The draft model proposes tokens, and the full model verifies them. Any incorrect predictions are rejected and regenerated by the full model. The output distribution is identical β€” you get the same quality at higher speed. The only tradeoff is slightly higher time-to-first-token (about 50ms more) due to the draft model overhead.

Can I combine multiple optimization strategies?

Yes, and you should. The strategies stack. Use reasoning effort tuning to reduce output tokens, prompt caching to reduce input token costs, batch API for non-real-time work, and prompt compression to reduce total token count. A workload using all four strategies can achieve 60-70% cost reduction compared to naive API usage. The key is building the routing and caching infrastructure to apply the right optimization to each request type.