How to Run Mistral Medium 3.5 Locally β Hardware, Setup, and Quantization Guide (2026)
Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
Mistral Medium 3.5 is a 128B dense model β open weights, 256K context, 77.6% on SWE-bench Verified. It is one of the strongest open-weight models available today, but running it locally requires serious hardware. This is not a model you pull on a laptop and start chatting with.
This guide covers exactly what you need: hardware requirements, quantization options, inference engine setup, and when renting cloud GPUs makes more sense than buying them. For a broader overview of the modelβs capabilities and benchmarks, see our Mistral Medium 3.5 complete guide.
Hardware Requirements
Mistral Medium 3.5 is a dense 128B parameter model. Every parameter is active on every forward pass β there is no MoE sparsity to save you. That means you need enough VRAM to hold the full model plus KV cache for your context window.
| Precision | Model Size | Minimum Hardware | Notes |
|---|---|---|---|
| FP16 (full) | ~256 GB | 4x H100 80GB or 8x A100 80GB | Maximum quality, production use |
| FP8 | ~128 GB | 2x H100 80GB or 4x A100 80GB | Recommended balance |
| Q4 (GGUF) | ~64 GB | 2x A100 80GB or creative offloading | Some quality loss on nuanced tasks |
Key points:
- This is not possible on consumer GPUs. Even at Q4 quantization, you need ~64 GB of VRAM. No combination of RTX 4090s (24 GB each) will comfortably run this model at usable speeds without heavy CPU offloading that kills throughput.
- NVLink matters. Multi-GPU setups need high-bandwidth interconnects. PCIe will bottleneck tensor-parallel inference significantly.
- Budget 256 GB+ system RAM for any configuration. Model loading, KV cache overflow, and OS overhead all eat memory.
- Fast NVMe storage (2 TB+) reduces model load times. The FP16 checkpoint is ~256 GB on disk.
For context on VRAM planning across different models, see our GPU memory planning guide and how much VRAM do AI models need.
Quantization: GGUF via Unsloth
Community-quantized GGUF versions are available on Hugging Face, courtesy of Unsloth:
Repository: unsloth/Mistral-Medium-3.5-128B-GGUF
Available quantization levels:
| Quantization | Approx. Size | Quality Impact | Use Case |
|---|---|---|---|
| Q8_0 | ~128 GB | Minimal | Near-FP16 quality, needs same hardware |
| Q6_K | ~96 GB | Very slight | Good balance if you have 2x H100 |
| Q5_K_M | ~80 GB | Slight | Fits tighter multi-GPU setups |
| Q4_K_M | ~64 GB | Moderate | Minimum viable for testing |
| Q3_K_M | ~48 GB | Noticeable | Not recommended for production |
Q4_K_M at ~64 GB is the sweet spot for teams that want to experiment without a full H100 cluster. Quality holds up well for coding and general reasoning tasks, though you may notice degradation on highly nuanced instruction-following.
Download a specific quantization:
# Install huggingface-cli if needed
pip install huggingface_hub
# Download Q4_K_M quantization
huggingface-cli download unsloth/Mistral-Medium-3.5-128B-GGUF \
--include "Mistral-Medium-3.5-128B-Q4_K_M.gguf" \
--local-dir ./models
EAGLE Speculative Decoding
Mistral provides an official EAGLE speculative decoding model: Mistral-Medium-3.5-128B-EAGLE. This is a small draft model that predicts multiple tokens ahead, which the main model then verifies in a single forward pass. The result is 1.5 to 2x faster token generation with no quality loss.
EAGLE works by running a lightweight draft head alongside the main model. It proposes candidate token sequences, and the main model accepts or rejects them in batch. Since verification is cheaper than generation, you get more tokens per second without changing the output distribution.
This is supported natively in vLLM:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Medium-3.5-128B-Instruct \
--speculative-model mistralai/Mistral-Medium-3.5-128B-EAGLE \
--num-speculative-tokens 5 \
--tensor-parallel-size 4 \
--dtype float8 \
--max-model-len 65536 \
--trust-remote-code \
--port 8000
The EAGLE model adds minimal VRAM overhead (~1-2 GB) since it shares the main modelβs weights and only adds a small prediction head. If you are running Medium 3.5 in production, there is no reason not to enable it.
vLLM Setup (Recommended)
vLLM is the recommended inference engine for Mistral Medium 3.5. It handles tensor parallelism, continuous batching, and PagedAttention out of the box.
Install the nightly build for the latest model support:
pip install vllm --pre --upgrade
Basic serve command (4x A100 80GB, FP8):
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Medium-3.5-128B-Instruct \
--tensor-parallel-size 4 \
--dtype float8 \
--max-model-len 65536 \
--trust-remote-code \
--port 8000
With EAGLE speculative decoding (recommended):
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Medium-3.5-128B-Instruct \
--speculative-model mistralai/Mistral-Medium-3.5-128B-EAGLE \
--num-speculative-tokens 5 \
--tensor-parallel-size 4 \
--dtype float8 \
--max-model-len 65536 \
--trust-remote-code \
--port 8000
8-GPU setup for higher throughput or longer context:
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Medium-3.5-128B-Instruct \
--speculative-model mistralai/Mistral-Medium-3.5-128B-EAGLE \
--num-speculative-tokens 5 \
--tensor-parallel-size 8 \
--dtype float16 \
--max-model-len 131072 \
--trust-remote-code \
--port 8000
Test the endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-Medium-3.5-128B-Instruct",
"messages": [{"role": "user", "content": "Write a Python async web scraper with rate limiting."}],
"max_tokens": 1024
}'
For a deeper comparison of inference engines, see our Ollama vs llama.cpp vs vLLM breakdown.
SGLang Setup
SGLang is a strong alternative to vLLM, especially if you need structured generation or RadixAttention caching for repeated prompt prefixes.
Using Docker (easiest):
docker run --gpus all -p 8100:8100 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model mistralai/Mistral-Medium-3.5-128B-Instruct \
--tp 4 \
--dtype float8 \
--context-length 65536 \
--port 8100
Using pip:
pip install sglang[all] --upgrade
python -m sglang.launch_server \
--model mistralai/Mistral-Medium-3.5-128B-Instruct \
--tp 4 \
--dtype float8 \
--context-length 65536 \
--port 8100
SGLang exposes an OpenAI-compatible API:
curl http://localhost:8100/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-Medium-3.5-128B-Instruct",
"messages": [{"role": "user", "content": "Explain the CAP theorem with examples."}],
"max_tokens": 512
}'
Ollama
Mistral Medium 3.5 is available through Ollama. This is the simplest way to get started if you have the hardware, though Ollama is better suited for smaller models. For a 128B model, vLLM or SGLang will give you better throughput and more control.
ollama pull mistral-medium-3.5
ollama run mistral-medium-3.5
Ollama uses llama.cpp under the hood and will automatically select the best quantization for your available memory. For more on running Mistral models with Ollama, see our how to run Mistral models locally guide.
Cloud GPU Alternatives
Most developers wonβt have 4x A100s at home. Cloud GPU rental is the practical path.
RunPod is one of the most accessible options. A100 80GB instances run approximately $1.50β2.00/hr, and new accounts get $5 in free credits β enough to test Medium 3.5 for a couple of hours. Spin up a 4x A100 pod, install vLLM, and you are running in under 30 minutes.
| Provider | GPU | Approx. Cost/hr | Setup Complexity |
|---|---|---|---|
| RunPod | A100 80GB | $1.50β2.00 | Low (templates available) |
| Lambda Cloud | A100 80GB | $1.50β2.50 | Low |
| CoreWeave | H100 80GB | $3.00β4.00 | Medium |
| AWS (p5 instances) | H100 80GB | $4.00+ | High |
For a 4x A100 setup on RunPod, expect to pay roughly $6β8/hr. That is expensive for always-on use, but very reasonable for development, testing, and batch processing workloads.
For a full comparison, see our best cloud GPU providers 2026 guide.
Performance Expectations
Throughput varies significantly based on hardware, quantization, and whether EAGLE speculative decoding is enabled.
| Setup | Precision | EAGLE | Approx. tok/s (single user) |
|---|---|---|---|
| 4x A100 80GB | FP8 | No | 15β25 tok/s |
| 4x A100 80GB | FP8 | Yes | 25β40 tok/s |
| 2x H100 80GB | FP8 | No | 25β35 tok/s |
| 2x H100 80GB | FP8 | Yes | 40β55 tok/s |
| 8x A100 80GB | FP16 | Yes | 35β50 tok/s |
| 4x H100 80GB | FP16 | Yes | 55β75 tok/s |
These are single-user, single-request estimates. With continuous batching, total throughput scales well β vLLM can handle 10+ concurrent users on a 4x A100 setup, though per-user latency increases.
For interactive coding use, 20+ tok/s feels responsive. Anything below 10 tok/s starts to feel sluggish for real-time work.
When to Self-Host vs Use the API
Mistral offers Medium 3.5 via API at $1.50/M input tokens and $7.50/M output tokens. Here is how to think about the decision:
Use the API when:
- You process fewer than ~50M tokens/month
- You need zero infrastructure overhead
- You want instant access without GPU procurement
- Your workload is bursty (heavy some days, idle others)
Self-host when:
- You process 100M+ tokens/month (the break-even point)
- Data privacy or compliance requires on-premises inference
- You need guaranteed latency without rate limits
- You want to customize serving (batching, caching, routing)
Quick cost comparison:
A 4x A100 80GB cloud instance at ~$7/hr costs roughly $5,000/month running 24/7. At API rates, $5,000 buys you about 3.3M input tokens or 667K output tokens. If you are generating more than that monthly, self-hosting starts to make financial sense β and the gap widens quickly at scale.
For teams processing millions of tokens daily for coding agents, CI pipelines, or batch analysis, self-hosting can cut costs by 5 to 10x compared to API pricing.
FAQ
Can I run Mistral Medium 3.5 on a single GPU?
No. Even at Q4 quantization (~64 GB), the model exceeds the VRAM of any single consumer or professional GPU. The minimum practical setup is 2x A100 80GB with FP8 quantization. For consumer hardware, look at smaller Mistral models like Codestral or Devstral Small.
What is the cheapest way to try Medium 3.5 locally?
Sign up for RunPod and use the $5 free credits on a 4x A100 80GB pod. Install vLLM, pull the model, and you can test it for a couple of hours at no cost.
How does EAGLE speculative decoding affect quality?
It does not. EAGLE only changes the speed of generation, not the output distribution. The draft model proposes tokens, and the main model verifies them β rejected tokens are regenerated normally. You get the same outputs faster.
Should I use vLLM or SGLang?
For most users, vLLM is the safer choice β it has broader community support, more documentation, and native EAGLE support. SGLang can outperform vLLM on workloads with repeated prompt prefixes thanks to RadixAttention. If you are unsure, start with vLLM. See our Ollama vs llama.cpp vs vLLM comparison.
How does Medium 3.5 compare to DeepSeek V4-Flash for self-hosting?
V4-Flash (284B MoE, 13B active) is much easier to self-host β it fits on a single H200 or 2x A100 setup. Medium 3.5 (128B dense) needs 4x A100 minimum. However, Medium 3.5 scores higher on SWE-bench (77.6% vs ~76%) and has a simpler architecture that is easier to optimize. Choose V4-Flash if hardware is constrained; choose Medium 3.5 if you have the GPUs and want a dense model with no expert-routing complexity.
Is the Q4 GGUF quantization good enough for coding tasks?
For most coding tasks β code generation, refactoring, debugging β Q4_K_M performs well. You will see some degradation on tasks requiring very precise instruction following or subtle reasoning, but for day-to-day development work, the quality difference is minor. Start with Q4 to test, then move to FP8 for production if quality matters.
Can I use Medium 3.5 with coding agents like Aider or Claude Code?
Yes. Once you have vLLM or SGLang running with an OpenAI-compatible endpoint, any tool that supports custom OpenAI-compatible APIs can connect to it. Point your coding agent at http://localhost:8000/v1 and set the model name accordingly.