Mistral Medium 3.5 Token Efficiency β How to Optimize Costs and Speed (2026)
Mistral Medium 3.5 is priced at $1.50 per million input tokens and $7.50 per million output tokens. That is 5x cheaper than Claude 3.5 Sonnet on input and roughly comparable on output. But raw per-token pricing only tells part of the story. The real cost depends on how you use the model β and Mistral gives you more levers to pull than most providers.
Configurable reasoning effort, prompt caching at 90% discount, batch API processing, and EAGLE speculative decoding for self-hosted deployments. Used correctly, these features can cut your effective costs by 40-70% compared to naive API usage.
This guide covers every optimization available, with concrete numbers and configuration examples.
For general model capabilities and setup, see the Mistral Medium 3.5 complete guide. For API integration details, see the Mistral Medium 3.5 API guide.
Pricing breakdown
Base API pricing
| Token type | Price per million tokens |
|---|---|
| Input tokens | $1.50 |
| Output tokens | $7.50 |
| Cached input tokens | $0.15 |
| Batch API input | $0.75 |
| Batch API output | $3.75 |
Output tokens cost 5x more than input tokens. This is the single most important fact for cost optimization. Every strategy that reduces output token count has an outsized impact on your bill.
Monthly cost estimates by usage tier
| Usage level | Input tokens/mo | Output tokens/mo | Base cost | Optimized cost |
|---|---|---|---|---|
| Light (individual dev) | 5M | 1M | $15 | $5-8 |
| Medium (small team) | 50M | 10M | $150 | $50-80 |
| Heavy (production app) | 500M | 100M | $1,500 | $400-700 |
| Enterprise (high volume) | 5B | 1B | $15,000 | $4,000-7,000 |
The βoptimized costβ column assumes you implement the strategies in this guide: reasoning effort tuning, prompt caching, batch processing where applicable, and prompt optimization. The range depends on your workload characteristics.
Reasoning effort optimization
This is the highest-impact optimization available. Mistral Medium 3.5 supports configurable reasoning effort through the reasoning_effort parameter, which controls how much compute the model spends on chain-of-thought reasoning before generating a response.
Available reasoning levels
| Level | Behavior | Token overhead | Best for |
|---|---|---|---|
| none | No chain-of-thought reasoning | 0% overhead | Classification, extraction, formatting |
| low | Minimal reasoning | ~20% overhead | Simple Q&A, summarization |
| medium | Balanced reasoning (default) | ~50% overhead | General tasks, content generation |
| high | Extended reasoning | ~100-200% overhead | Complex coding, math, multi-step analysis |
How reasoning effort affects costs
When reasoning effort is set to high, the model generates internal reasoning tokens before producing the visible output. These reasoning tokens count toward your output token usage. On complex tasks, reasoning tokens can be 2-3x the visible output length.
Setting reasoning effort to none for simple tasks eliminates this overhead entirely. The model skips chain-of-thought and generates the answer directly.
from mistralai import Mistral
client = Mistral(api_key="your-key")
# Simple classification β no reasoning needed
response = client.chat.complete(
model="mistral-medium-3.5",
messages=[{"role": "user", "content": "Classify this support ticket as billing, technical, or general: 'I can't log in to my account'"}],
reasoning_effort="none"
)
# Complex code review β full reasoning
response = client.chat.complete(
model="mistral-medium-3.5",
messages=[{"role": "user", "content": "Review this PR for security vulnerabilities and suggest fixes: ..."}],
reasoning_effort="high"
)
Practical savings from reasoning effort tuning
For a typical enterprise workload with a mix of task types:
- 40% simple tasks (classification, extraction, formatting): Switch to
noneβ save ~95% of reasoning tokens on these tasks - 30% moderate tasks (summarization, Q&A): Switch to
lowβ save ~60% of reasoning tokens - 30% complex tasks (coding, analysis): Keep at
highβ no savings, but better quality
Net result: 40-60% reduction in total output tokens compared to running everything at medium or high.
The key is building a routing layer that classifies incoming requests and sets the appropriate reasoning effort. A simple keyword-based classifier or a small model (Mistral Small) can handle this routing at negligible cost.
Prompt caching strategies
Mistralβs prompt caching gives you a 90% discount on input tokens that match a previously cached prompt prefix. This is transformative for workloads with repeated system prompts or shared context.
How prompt caching works
When you send a request, Mistral checks if the beginning of your prompt matches a cached prefix from a recent request. If it does, those tokens are charged at $0.15 per million instead of $1.50 per million.
Cache hits require:
- Same model
- Same account
- Identical token sequence from the start of the prompt
- Request within the cache TTL (typically 5-10 minutes)
Maximizing cache hit rates
Structure your prompts with the most stable content first:
[System prompt β same across all requests] β cached
[Shared context β same within a session] β cached
[Few-shot examples β same for task type] β cached
[User-specific input β varies per request] β not cached
The longer your cached prefix, the bigger the savings. A 10,000-token system prompt that hits cache on every request saves $13.50 per million requests compared to uncached pricing.
Caching strategies by workload type
Chatbot/assistant: Place your system prompt and persona instructions first. These stay constant across all conversations. Cache hit rate: 80-95%.
Document processing pipeline: Batch documents that share the same extraction schema. Send the schema as the system prompt, then individual documents as user messages. Cache hit rate: 90%+.
Code review automation: Use a fixed system prompt with coding standards and review criteria. The prompt stays cached across all PRs. Cache hit rate: 85-95%.
RAG applications: This is trickier. Retrieved context changes per query, which breaks caching. Structure your prompt so the system instructions and few-shot examples come before the retrieved chunks. You will cache the instruction prefix but not the retrieved content.
For more on caching strategies across providers, see prompt caching explained.
Batch API for non-real-time work
Mistralβs batch API processes requests asynchronously at 50% of standard pricing. Requests are queued and processed within a 24-hour window.
When to use batch API
- Nightly data processing pipelines
- Bulk content generation
- Dataset labeling and classification
- Code analysis across large repositories
- Any workload where latency is not critical
Batch API pricing
| Token type | Standard | Batch | Savings |
|---|---|---|---|
| Input | $1.50/M | $0.75/M | 50% |
| Output | $7.50/M | $3.75/M | 50% |
Implementation example
import json
from mistralai import Mistral
client = Mistral(api_key="your-key")
# Prepare batch requests
requests = []
for i, document in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"model": "mistral-medium-3.5",
"messages": [
{"role": "system", "content": "Extract key entities from this document."},
{"role": "user", "content": document}
],
"reasoning_effort": "none"
})
# Write JSONL file
with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Upload and create batch job
batch_file = client.files.upload(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batch.create(input_file_id=batch_file.id, model="mistral-medium-3.5")
# Poll for completion
# Results available within 24 hours
Combining batch API with reasoning_effort="none" for simple extraction tasks gives you the maximum discount: 50% from batch pricing plus eliminated reasoning overhead.
EAGLE speculative decoding for self-hosted deployments
If you self-host Mistral Medium 3.5, EAGLE speculative decoding can increase inference speed by 2-3x without quality loss.
How EAGLE works
EAGLE uses a small draft model to predict multiple tokens ahead, then verifies them with the full model in a single forward pass. When predictions are correct (which happens frequently for common patterns), you get multiple tokens for the cost of one forward pass.
Mistral provides EAGLE draft model weights alongside the main model weights. Setup with vLLM:
vllm serve mistral-medium-3.5 \
--speculative-model mistral-medium-3.5-eagle \
--num-speculative-tokens 5 \
--tensor-parallel-size 4
Performance impact
| Metric | Without EAGLE | With EAGLE |
|---|---|---|
| Tokens/second | ~45 | ~110 |
| Time to first token | ~200ms | ~250ms |
| Total throughput | 1x | 2.4x |
EAGLE increases time-to-first-token slightly (the draft model adds a small overhead) but dramatically improves tokens-per-second for generation. For long outputs, the net effect is significantly faster responses.
This does not reduce cost per token β you are still using the same hardware. But it increases throughput per GPU, which means you need fewer GPUs to serve the same request volume. That translates to lower infrastructure costs.
API vs self-hosting break-even analysis
At what point does self-hosting become cheaper than the API?
Cost model
API cost: Directly proportional to token volume. No fixed costs.
Self-hosting cost: Fixed infrastructure cost (GPU rental or purchase) plus operational overhead (engineering time, monitoring, updates).
Break-even calculation
Assuming 4x A100 80GB on a cloud provider at ~$12/hour ($8,640/month):
| Monthly token volume | API cost | Self-host cost | Winner |
|---|---|---|---|
| 50M input + 10M output | $150 | $8,640 | API |
| 500M input + 100M output | $1,500 | $8,640 | API |
| 2B input + 400M output | $6,000 | $8,640 | API |
| 5B input + 1B output | $15,000 | $8,640 | Self-host |
| 10B input + 2B output | $30,000 | $8,640 | Self-host |
The break-even point is roughly 3-4 billion input tokens + 500M-700M output tokens per month, depending on your GPU costs and utilization rate.
Below that volume, the API is cheaper and requires zero infrastructure management. Above it, self-hosting saves money β but you need engineering capacity to manage the deployment.
For more on managing AI infrastructure costs, see AI agent cost management.
Tips for reducing token usage
Beyond the major optimizations above, these smaller techniques compound:
1. Compress system prompts
Most system prompts are verbose. A 2,000-token system prompt that could be 500 tokens wastes 1,500 tokens per request. At 1M requests/month, that is 1.5B wasted input tokens ($2,250 at standard pricing, $225 with caching).
Audit your system prompts. Remove redundant instructions. Use concise language. Test whether shorter prompts produce equivalent output quality.
2. Set max_tokens appropriately
If you know the expected output length, set max_tokens to a reasonable limit. This prevents the model from generating unnecessarily long responses. A classification task needs 10 tokens, not 1,000.
3. Use structured output
Request JSON output with a defined schema. This produces shorter, more predictable responses than free-form text. Mistral Medium 3.5 supports JSON mode natively.
response = client.chat.complete(
model="mistral-medium-3.5",
messages=[...],
response_format={"type": "json_object"}
)
4. Batch related requests
Instead of sending 10 separate API calls for 10 documents, send one request with all 10 documents and ask for a combined response. This shares the system prompt cost across all documents and reduces per-request overhead.
5. Cache responses application-side
If the same query appears frequently, cache the response in your application layer (Redis, in-memory cache). This eliminates the API call entirely for repeated queries.
For more strategies on reducing LLM costs across providers, see how to reduce LLM API costs.
Cost comparison with alternatives
How does optimized Mistral Medium 3.5 compare to other models?
| Model | Input $/M | Output $/M | Optimized monthly cost (medium usage) |
|---|---|---|---|
| Mistral Medium 3.5 | $1.50 | $7.50 | $50-80 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | $120-180 |
| GPT-4o | $2.50 | $10.00 | $100-150 |
| Gemini 1.5 Pro | $1.25 | $5.00 | $45-70 |
| DeepSeek V3 | $0.27 | $1.10 | $15-25 |
| Llama 3.3 70B (self-hosted) | ~$0.50* | ~$2.00* | $25-40 |
*Self-hosted costs are approximate and depend on infrastructure.
Mistral Medium 3.5 is not the cheapest option. DeepSeek V3 and self-hosted Llama are significantly cheaper per token. But Mistral offers a better quality-to-cost ratio than Claude or GPT-4o, with the added benefits of EU data residency and open weights.
The right choice depends on your quality requirements. If Mistral Medium 3.5βs benchmark performance meets your needs, it is one of the most cost-effective frontier models available β especially with the optimizations in this guide.
FAQ
How much can I realistically save with reasoning effort tuning?
40-60% on total output token costs for mixed workloads. The savings depend on your task distribution. If 80% of your requests are simple (classification, extraction, formatting), you can save more. If most requests are complex coding or analysis tasks that need full reasoning, savings will be lower. Start by auditing your request types and categorizing them by complexity.
Does prompt caching work across different users?
Yes, as long as the requests share the same prompt prefix and are on the same account. The cache matches on the exact token sequence from the start of the prompt. Different users sending requests with the same system prompt will benefit from caching. The variable user input at the end of the prompt does not affect caching of the shared prefix.
Is the batch API suitable for production workloads?
Only for non-latency-sensitive workloads. Batch requests are processed within a 24-hour window β you cannot guarantee when results will be ready. Use it for overnight processing, bulk analysis, dataset preparation, and similar tasks. For real-time user-facing applications, use the standard API.
When should I self-host instead of using the API?
When your monthly token volume exceeds roughly 3-4 billion input tokens plus 500-700 million output tokens. Below that, the API is cheaper and requires no infrastructure management. Above it, self-hosting on 4x A100/H100 GPUs becomes more cost-effective. Also consider self-hosting if you need data sovereignty guarantees beyond what the API provides, or if you need custom model modifications.
Does EAGLE speculative decoding affect output quality?
No. EAGLE is mathematically equivalent to standard decoding. The draft model proposes tokens, and the full model verifies them. Any incorrect predictions are rejected and regenerated by the full model. The output distribution is identical β you get the same quality at higher speed. The only tradeoff is slightly higher time-to-first-token (about 50ms more) due to the draft model overhead.
Can I combine multiple optimization strategies?
Yes, and you should. The strategies stack. Use reasoning effort tuning to reduce output tokens, prompt caching to reduce input token costs, batch API for non-real-time work, and prompt compression to reduce total token count. A workload using all four strategies can achieve 60-70% cost reduction compared to naive API usage. The key is building the routing and caching infrastructure to apply the right optimization to each request type.