πŸ€– AI Tools
Β· 9 min read

How to Run Mistral Medium 3.5 Locally β€” Hardware, Setup, and Quantization Guide (2026)


Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

Mistral Medium 3.5 is a 128B dense model β€” open weights, 256K context, 77.6% on SWE-bench Verified. It is one of the strongest open-weight models available today, but running it locally requires serious hardware. This is not a model you pull on a laptop and start chatting with.

This guide covers exactly what you need: hardware requirements, quantization options, inference engine setup, and when renting cloud GPUs makes more sense than buying them. For a broader overview of the model’s capabilities and benchmarks, see our Mistral Medium 3.5 complete guide.

Hardware Requirements

Mistral Medium 3.5 is a dense 128B parameter model. Every parameter is active on every forward pass β€” there is no MoE sparsity to save you. That means you need enough VRAM to hold the full model plus KV cache for your context window.

PrecisionModel SizeMinimum HardwareNotes
FP16 (full)~256 GB4x H100 80GB or 8x A100 80GBMaximum quality, production use
FP8~128 GB2x H100 80GB or 4x A100 80GBRecommended balance
Q4 (GGUF)~64 GB2x A100 80GB or creative offloadingSome quality loss on nuanced tasks

Key points:

  • This is not possible on consumer GPUs. Even at Q4 quantization, you need ~64 GB of VRAM. No combination of RTX 4090s (24 GB each) will comfortably run this model at usable speeds without heavy CPU offloading that kills throughput.
  • NVLink matters. Multi-GPU setups need high-bandwidth interconnects. PCIe will bottleneck tensor-parallel inference significantly.
  • Budget 256 GB+ system RAM for any configuration. Model loading, KV cache overflow, and OS overhead all eat memory.
  • Fast NVMe storage (2 TB+) reduces model load times. The FP16 checkpoint is ~256 GB on disk.

For context on VRAM planning across different models, see our GPU memory planning guide and how much VRAM do AI models need.

Quantization: GGUF via Unsloth

Community-quantized GGUF versions are available on Hugging Face, courtesy of Unsloth:

Repository: unsloth/Mistral-Medium-3.5-128B-GGUF

Available quantization levels:

QuantizationApprox. SizeQuality ImpactUse Case
Q8_0~128 GBMinimalNear-FP16 quality, needs same hardware
Q6_K~96 GBVery slightGood balance if you have 2x H100
Q5_K_M~80 GBSlightFits tighter multi-GPU setups
Q4_K_M~64 GBModerateMinimum viable for testing
Q3_K_M~48 GBNoticeableNot recommended for production

Q4_K_M at ~64 GB is the sweet spot for teams that want to experiment without a full H100 cluster. Quality holds up well for coding and general reasoning tasks, though you may notice degradation on highly nuanced instruction-following.

Download a specific quantization:

# Install huggingface-cli if needed
pip install huggingface_hub

# Download Q4_K_M quantization
huggingface-cli download unsloth/Mistral-Medium-3.5-128B-GGUF \
  --include "Mistral-Medium-3.5-128B-Q4_K_M.gguf" \
  --local-dir ./models

EAGLE Speculative Decoding

Mistral provides an official EAGLE speculative decoding model: Mistral-Medium-3.5-128B-EAGLE. This is a small draft model that predicts multiple tokens ahead, which the main model then verifies in a single forward pass. The result is 1.5 to 2x faster token generation with no quality loss.

EAGLE works by running a lightweight draft head alongside the main model. It proposes candidate token sequences, and the main model accepts or rejects them in batch. Since verification is cheaper than generation, you get more tokens per second without changing the output distribution.

This is supported natively in vLLM:

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Medium-3.5-128B-Instruct \
  --speculative-model mistralai/Mistral-Medium-3.5-128B-EAGLE \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4 \
  --dtype float8 \
  --max-model-len 65536 \
  --trust-remote-code \
  --port 8000

The EAGLE model adds minimal VRAM overhead (~1-2 GB) since it shares the main model’s weights and only adds a small prediction head. If you are running Medium 3.5 in production, there is no reason not to enable it.

vLLM is the recommended inference engine for Mistral Medium 3.5. It handles tensor parallelism, continuous batching, and PagedAttention out of the box.

Install the nightly build for the latest model support:

pip install vllm --pre --upgrade

Basic serve command (4x A100 80GB, FP8):

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Medium-3.5-128B-Instruct \
  --tensor-parallel-size 4 \
  --dtype float8 \
  --max-model-len 65536 \
  --trust-remote-code \
  --port 8000
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Medium-3.5-128B-Instruct \
  --speculative-model mistralai/Mistral-Medium-3.5-128B-EAGLE \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4 \
  --dtype float8 \
  --max-model-len 65536 \
  --trust-remote-code \
  --port 8000

8-GPU setup for higher throughput or longer context:

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Medium-3.5-128B-Instruct \
  --speculative-model mistralai/Mistral-Medium-3.5-128B-EAGLE \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 8 \
  --dtype float16 \
  --max-model-len 131072 \
  --trust-remote-code \
  --port 8000

Test the endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Medium-3.5-128B-Instruct",
    "messages": [{"role": "user", "content": "Write a Python async web scraper with rate limiting."}],
    "max_tokens": 1024
  }'

For a deeper comparison of inference engines, see our Ollama vs llama.cpp vs vLLM breakdown.

SGLang Setup

SGLang is a strong alternative to vLLM, especially if you need structured generation or RadixAttention caching for repeated prompt prefixes.

Using Docker (easiest):

docker run --gpus all -p 8100:8100 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model mistralai/Mistral-Medium-3.5-128B-Instruct \
    --tp 4 \
    --dtype float8 \
    --context-length 65536 \
    --port 8100

Using pip:

pip install sglang[all] --upgrade

python -m sglang.launch_server \
  --model mistralai/Mistral-Medium-3.5-128B-Instruct \
  --tp 4 \
  --dtype float8 \
  --context-length 65536 \
  --port 8100

SGLang exposes an OpenAI-compatible API:

curl http://localhost:8100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Medium-3.5-128B-Instruct",
    "messages": [{"role": "user", "content": "Explain the CAP theorem with examples."}],
    "max_tokens": 512
  }'

Ollama

Mistral Medium 3.5 is available through Ollama. This is the simplest way to get started if you have the hardware, though Ollama is better suited for smaller models. For a 128B model, vLLM or SGLang will give you better throughput and more control.

ollama pull mistral-medium-3.5
ollama run mistral-medium-3.5

Ollama uses llama.cpp under the hood and will automatically select the best quantization for your available memory. For more on running Mistral models with Ollama, see our how to run Mistral models locally guide.

Cloud GPU Alternatives

Most developers won’t have 4x A100s at home. Cloud GPU rental is the practical path.

RunPod is one of the most accessible options. A100 80GB instances run approximately $1.50–2.00/hr, and new accounts get $5 in free credits β€” enough to test Medium 3.5 for a couple of hours. Spin up a 4x A100 pod, install vLLM, and you are running in under 30 minutes.

ProviderGPUApprox. Cost/hrSetup Complexity
RunPodA100 80GB$1.50–2.00Low (templates available)
Lambda CloudA100 80GB$1.50–2.50Low
CoreWeaveH100 80GB$3.00–4.00Medium
AWS (p5 instances)H100 80GB$4.00+High

For a 4x A100 setup on RunPod, expect to pay roughly $6–8/hr. That is expensive for always-on use, but very reasonable for development, testing, and batch processing workloads.

For a full comparison, see our best cloud GPU providers 2026 guide.

Performance Expectations

Throughput varies significantly based on hardware, quantization, and whether EAGLE speculative decoding is enabled.

SetupPrecisionEAGLEApprox. tok/s (single user)
4x A100 80GBFP8No15–25 tok/s
4x A100 80GBFP8Yes25–40 tok/s
2x H100 80GBFP8No25–35 tok/s
2x H100 80GBFP8Yes40–55 tok/s
8x A100 80GBFP16Yes35–50 tok/s
4x H100 80GBFP16Yes55–75 tok/s

These are single-user, single-request estimates. With continuous batching, total throughput scales well β€” vLLM can handle 10+ concurrent users on a 4x A100 setup, though per-user latency increases.

For interactive coding use, 20+ tok/s feels responsive. Anything below 10 tok/s starts to feel sluggish for real-time work.

When to Self-Host vs Use the API

Mistral offers Medium 3.5 via API at $1.50/M input tokens and $7.50/M output tokens. Here is how to think about the decision:

Use the API when:

  • You process fewer than ~50M tokens/month
  • You need zero infrastructure overhead
  • You want instant access without GPU procurement
  • Your workload is bursty (heavy some days, idle others)

Self-host when:

  • You process 100M+ tokens/month (the break-even point)
  • Data privacy or compliance requires on-premises inference
  • You need guaranteed latency without rate limits
  • You want to customize serving (batching, caching, routing)

Quick cost comparison:

A 4x A100 80GB cloud instance at ~$7/hr costs roughly $5,000/month running 24/7. At API rates, $5,000 buys you about 3.3M input tokens or 667K output tokens. If you are generating more than that monthly, self-hosting starts to make financial sense β€” and the gap widens quickly at scale.

For teams processing millions of tokens daily for coding agents, CI pipelines, or batch analysis, self-hosting can cut costs by 5 to 10x compared to API pricing.

FAQ

Can I run Mistral Medium 3.5 on a single GPU?

No. Even at Q4 quantization (~64 GB), the model exceeds the VRAM of any single consumer or professional GPU. The minimum practical setup is 2x A100 80GB with FP8 quantization. For consumer hardware, look at smaller Mistral models like Codestral or Devstral Small.

What is the cheapest way to try Medium 3.5 locally?

Sign up for RunPod and use the $5 free credits on a 4x A100 80GB pod. Install vLLM, pull the model, and you can test it for a couple of hours at no cost.

How does EAGLE speculative decoding affect quality?

It does not. EAGLE only changes the speed of generation, not the output distribution. The draft model proposes tokens, and the main model verifies them β€” rejected tokens are regenerated normally. You get the same outputs faster.

Should I use vLLM or SGLang?

For most users, vLLM is the safer choice β€” it has broader community support, more documentation, and native EAGLE support. SGLang can outperform vLLM on workloads with repeated prompt prefixes thanks to RadixAttention. If you are unsure, start with vLLM. See our Ollama vs llama.cpp vs vLLM comparison.

How does Medium 3.5 compare to DeepSeek V4-Flash for self-hosting?

V4-Flash (284B MoE, 13B active) is much easier to self-host β€” it fits on a single H200 or 2x A100 setup. Medium 3.5 (128B dense) needs 4x A100 minimum. However, Medium 3.5 scores higher on SWE-bench (77.6% vs ~76%) and has a simpler architecture that is easier to optimize. Choose V4-Flash if hardware is constrained; choose Medium 3.5 if you have the GPUs and want a dense model with no expert-routing complexity.

Is the Q4 GGUF quantization good enough for coding tasks?

For most coding tasks β€” code generation, refactoring, debugging β€” Q4_K_M performs well. You will see some degradation on tasks requiring very precise instruction following or subtle reasoning, but for day-to-day development work, the quality difference is minor. Start with Q4 to test, then move to FP8 for production if quality matters.

Can I use Medium 3.5 with coding agents like Aider or Claude Code?

Yes. Once you have vLLM or SGLang running with an OpenAI-compatible endpoint, any tool that supports custom OpenAI-compatible APIs can connect to it. Point your coding agent at http://localhost:8000/v1 and set the model name accordingly.

See also