Qwen 3.6-35B-A3B is one of the most capable models you can run on consumer hardware right now. It scores 73.4% on SWE-bench Verified โ on par with frontier API models โ yet it runs on a MacBook Pro or a single GPU thanks to its Mixture-of-Experts architecture: 35 billion total parameters, but only 3 billion active at inference time.
Update (April 23, 2026): Alibaba released Qwen 3.6-27B, a 27B dense model that runs on a Mac with 22GB VRAM. See our 27B local setup guide for hardware requirements and Ollama/vLLM setup.
Itโs Apache 2.0 licensed, supports 262K context (extensible to 1M via YaRN), handles vision (image + video), and has thinking mode enabled by default. Simon Willison has been running it on his M5 MacBook via LM Studio. Thereโs no reason you canโt do the same.
This guide covers three ways to get it running: Ollama (fastest setup), LM Studio (GUI), and vLLM (production serving). For a deeper look at the model itself, see our Qwen 3.6-35B-A3B complete guide.
Hardware Requirements
You donโt need a data center. The MoE architecture keeps memory usage surprisingly low.
| Tier | Hardware | Quantization | Experience |
|---|---|---|---|
| Minimum | 16GB RAM (M-series Mac) or 16GB VRAM GPU | Q3_K_S (~14GB) | Usable, some quality loss. Short context only. |
| Recommended | 24GB+ RAM (Mac) or 24GB VRAM GPU (RTX 3090/4090) | Q4_K_S (~21GB) | Good quality, solid speed. Best bang for buck. |
| Ideal | 32GB+ RAM (Mac) or 48GB+ VRAM | Q4_K_M (~24GB) or higher | Near-full quality, long context, fast inference. |
CPU-only inference works but is slow โ expect 1-3 tokens/second. If your hardware doesnโt meet the recommended tier, cloud GPU providers let you rent RTX 4090s or A100s by the hour for a few dollars. For more on VRAM planning, see How Much VRAM Do You Need for AI?
Which Quantization to Pick
All GGUF quantizations below are available from Unsloth on Hugging Face.
| Quantization | File Size | Quality | Best For |
|---|---|---|---|
| Q3_K_S | ~14GB | Lower โ noticeable degradation | 16GB machines, testing only |
| Q4_K_S | ~21GB | Good โ minimal quality loss | 24GB VRAM / 24GB+ Mac RAM (recommended) |
| Q4_K_M | ~24GB | Better โ slightly sharper reasoning | 32GB+ Mac RAM / 24GB VRAM with tight fit |
| Q5_K_M | ~28GB | Near-original | 48GB+ VRAM or 36GB+ Mac RAM |
For most people: Q4_K_S is the sweet spot. It fits comfortably in 24GB and retains strong coding and reasoning performance.
Method 1: Ollama (Easiest)
Ollama handles downloading, quantization selection, and serving in one step. Three commands and youโre running.
1. Install Ollama (if you havenโt already):
curl -fsSL https://ollama.com/install.sh | sh
2. Pull and run the model:
ollama run qwen3.6:35b-a3b
Thatโs it. Ollama auto-selects a quantization that fits your hardware. To force a specific quant:
ollama run qwen3.6:35b-a3b-q4_K_S
3. Verify itโs working:
The model will drop you into an interactive chat. Thinking mode is on by default โ youโll see the modelโs reasoning before its answer. To disable thinking for faster responses, add /no_think to your prompt.
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 automatically, so you can point any tool at it immediately.
If you hit memory issues, check our Ollama out of memory fix guide.
Method 2: LM Studio (GUI)
LM Studio gives you a visual interface for downloading, configuring, and chatting with local models. Great if you prefer not to use the terminal.
- Download and install LM Studio for your platform.
- Open the app and click the Search bar.
- Search for โQwen3.6-35B-A3Bโ.
- Look for the Unsloth GGUF uploads. Select Q4_K_S (or Q3_K_S if youโre on 16GB).
- Click Download and wait for it to finish (~21GB for Q4_K_S).
- Go to the Chat tab, select the downloaded model, and start chatting.
To use it as a local API server: go to the Server tab, load the model, and start the server. It exposes an OpenAI-compatible endpoint at http://localhost:1234/v1.
Simon Willison confirmed this works well on Apple Silicon โ he ran it on his M5 MacBook with solid performance.
Method 3: vLLM (Production Serving)
For serving Qwen 3.6 to multiple users or integrating into a pipeline, vLLM gives you high-throughput inference with continuous batching.
1. Install vLLM:
pip install vllm
2. Start the server:
vllm serve Qwen/Qwen3.6-35B-A3B \
--tensor-parallel-size 1 \
--max-model-len 262144
This loads the full-precision model from Hugging Face. For multi-GPU setups, increase --tensor-parallel-size. The server exposes an OpenAI-compatible API at http://localhost:8000/v1.
Note: vLLM requires a CUDA GPU. For Mac users, stick with Ollama or LM Studio. SGLang is also supported as an alternative serving backend.
Recommended Sampling Settings
These settings work well for coding and reasoning tasks with Qwen 3.6:
| Parameter | Coding / Reasoning | Creative / Chat |
|---|---|---|
| Temperature | 0.6 | 0.8 |
| Top-P | 0.95 | 0.95 |
| Top-K | 20 | 40 |
| Min-P | 0.0 | 0.0 |
| Thinking | On (default) | Off for speed |
Keep thinking mode on for complex tasks โ it significantly improves accuracy. Disable it with /no_think in Ollama or via the enable_thinking=False parameter in the API for simple Q&A where speed matters more.
Using With Coding Tools
Qwen 3.6 works as a drop-in backend for popular local coding tools. Point them at your Ollama or LM Studio API:
- Aider:
aider --model ollama/qwen3.6:35b-a3bโ works out of the box. The 73.4% SWE-bench score means it handles real-world code edits well. - Continue.dev: Add an Ollama provider in your VS Code settings, select
qwen3.6:35b-a3bas the model. - Open WebUI: Connect to
http://localhost:11434and the model appears automatically.
For more local coding model options, see Best AI Models for Coding Locally in 2026.
Troubleshooting
Out of memory: Switch to a smaller quantization (Q3_K_S at ~14GB). Close other apps. On Mac, check Activity Monitor for memory pressure. See our Ollama out of memory fix for detailed solutions.
Slow inference (< 2 tok/s): Youโre likely running on CPU. Ensure Ollama detects your GPU with ollama ps. On Mac, make sure youโre on Apple Silicon โ Intel Macs will be painfully slow. Disable thinking mode for 2-3x speed improvement on simple tasks.
Context too long errors: The model supports 262K natively, but your available RAM limits effective context. Reduce --max-model-len in vLLM or use num_ctx in Ollamaโs Modelfile. For most local use, 8K-32K context is practical.
Model not found in Ollama: Make sure youโre on the latest version: ollama update. The Qwen 3.6 tags were added recently.
Related Links
- Qwen 3.6-35B-A3B Complete Guide โ benchmarks, architecture, and comparisons
- Qwen 3.6 Complete Guide โ full model family overview
- Ollama Complete Guide 2026 โ everything about Ollama
- How Much VRAM Do You Need for AI? โ VRAM calculator and planning
- Best AI Models for Coding Locally 2026 โ model comparisons
- How to Run Qwen 3.5 Locally โ previous generation guide