๐Ÿ“ Tutorials
ยท 5 min read

How to Run Qwen 3.6 Locally โ€” Ollama, LM Studio & vLLM (2026)


Qwen 3.6-35B-A3B is one of the most capable models you can run on consumer hardware right now. It scores 73.4% on SWE-bench Verified โ€” on par with frontier API models โ€” yet it runs on a MacBook Pro or a single GPU thanks to its Mixture-of-Experts architecture: 35 billion total parameters, but only 3 billion active at inference time.

Update (April 23, 2026): Alibaba released Qwen 3.6-27B, a 27B dense model that runs on a Mac with 22GB VRAM. See our 27B local setup guide for hardware requirements and Ollama/vLLM setup.

Itโ€™s Apache 2.0 licensed, supports 262K context (extensible to 1M via YaRN), handles vision (image + video), and has thinking mode enabled by default. Simon Willison has been running it on his M5 MacBook via LM Studio. Thereโ€™s no reason you canโ€™t do the same.

This guide covers three ways to get it running: Ollama (fastest setup), LM Studio (GUI), and vLLM (production serving). For a deeper look at the model itself, see our Qwen 3.6-35B-A3B complete guide.

Hardware Requirements

You donโ€™t need a data center. The MoE architecture keeps memory usage surprisingly low.

Tier Hardware Quantization Experience
Minimum 16GB RAM (M-series Mac) or 16GB VRAM GPU Q3_K_S (~14GB) Usable, some quality loss. Short context only.
Recommended 24GB+ RAM (Mac) or 24GB VRAM GPU (RTX 3090/4090) Q4_K_S (~21GB) Good quality, solid speed. Best bang for buck.
Ideal 32GB+ RAM (Mac) or 48GB+ VRAM Q4_K_M (~24GB) or higher Near-full quality, long context, fast inference.

CPU-only inference works but is slow โ€” expect 1-3 tokens/second. If your hardware doesnโ€™t meet the recommended tier, cloud GPU providers let you rent RTX 4090s or A100s by the hour for a few dollars. For more on VRAM planning, see How Much VRAM Do You Need for AI?

Which Quantization to Pick

All GGUF quantizations below are available from Unsloth on Hugging Face.

Quantization File Size Quality Best For
Q3_K_S ~14GB Lower โ€” noticeable degradation 16GB machines, testing only
Q4_K_S ~21GB Good โ€” minimal quality loss 24GB VRAM / 24GB+ Mac RAM (recommended)
Q4_K_M ~24GB Better โ€” slightly sharper reasoning 32GB+ Mac RAM / 24GB VRAM with tight fit
Q5_K_M ~28GB Near-original 48GB+ VRAM or 36GB+ Mac RAM

For most people: Q4_K_S is the sweet spot. It fits comfortably in 24GB and retains strong coding and reasoning performance.

Method 1: Ollama (Easiest)

Ollama handles downloading, quantization selection, and serving in one step. Three commands and youโ€™re running.

1. Install Ollama (if you havenโ€™t already):

curl -fsSL https://ollama.com/install.sh | sh

2. Pull and run the model:

ollama run qwen3.6:35b-a3b

Thatโ€™s it. Ollama auto-selects a quantization that fits your hardware. To force a specific quant:

ollama run qwen3.6:35b-a3b-q4_K_S

3. Verify itโ€™s working:

The model will drop you into an interactive chat. Thinking mode is on by default โ€” youโ€™ll see the modelโ€™s reasoning before its answer. To disable thinking for faster responses, add /no_think to your prompt.

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 automatically, so you can point any tool at it immediately.

If you hit memory issues, check our Ollama out of memory fix guide.

Method 2: LM Studio (GUI)

LM Studio gives you a visual interface for downloading, configuring, and chatting with local models. Great if you prefer not to use the terminal.

  1. Download and install LM Studio for your platform.
  2. Open the app and click the Search bar.
  3. Search for โ€œQwen3.6-35B-A3Bโ€.
  4. Look for the Unsloth GGUF uploads. Select Q4_K_S (or Q3_K_S if youโ€™re on 16GB).
  5. Click Download and wait for it to finish (~21GB for Q4_K_S).
  6. Go to the Chat tab, select the downloaded model, and start chatting.

To use it as a local API server: go to the Server tab, load the model, and start the server. It exposes an OpenAI-compatible endpoint at http://localhost:1234/v1.

Simon Willison confirmed this works well on Apple Silicon โ€” he ran it on his M5 MacBook with solid performance.

Method 3: vLLM (Production Serving)

For serving Qwen 3.6 to multiple users or integrating into a pipeline, vLLM gives you high-throughput inference with continuous batching.

1. Install vLLM:

pip install vllm

2. Start the server:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 1 \
  --max-model-len 262144

This loads the full-precision model from Hugging Face. For multi-GPU setups, increase --tensor-parallel-size. The server exposes an OpenAI-compatible API at http://localhost:8000/v1.

Note: vLLM requires a CUDA GPU. For Mac users, stick with Ollama or LM Studio. SGLang is also supported as an alternative serving backend.

These settings work well for coding and reasoning tasks with Qwen 3.6:

Parameter Coding / Reasoning Creative / Chat
Temperature 0.6 0.8
Top-P 0.95 0.95
Top-K 20 40
Min-P 0.0 0.0
Thinking On (default) Off for speed

Keep thinking mode on for complex tasks โ€” it significantly improves accuracy. Disable it with /no_think in Ollama or via the enable_thinking=False parameter in the API for simple Q&A where speed matters more.

Using With Coding Tools

Qwen 3.6 works as a drop-in backend for popular local coding tools. Point them at your Ollama or LM Studio API:

  • Aider: aider --model ollama/qwen3.6:35b-a3b โ€” works out of the box. The 73.4% SWE-bench score means it handles real-world code edits well.
  • Continue.dev: Add an Ollama provider in your VS Code settings, select qwen3.6:35b-a3b as the model.
  • Open WebUI: Connect to http://localhost:11434 and the model appears automatically.

For more local coding model options, see Best AI Models for Coding Locally in 2026.

Troubleshooting

Out of memory: Switch to a smaller quantization (Q3_K_S at ~14GB). Close other apps. On Mac, check Activity Monitor for memory pressure. See our Ollama out of memory fix for detailed solutions.

Slow inference (< 2 tok/s): Youโ€™re likely running on CPU. Ensure Ollama detects your GPU with ollama ps. On Mac, make sure youโ€™re on Apple Silicon โ€” Intel Macs will be painfully slow. Disable thinking mode for 2-3x speed improvement on simple tasks.

Context too long errors: The model supports 262K natively, but your available RAM limits effective context. Reduce --max-model-len in vLLM or use num_ctx in Ollamaโ€™s Modelfile. For most local use, 8K-32K context is practical.

Model not found in Ollama: Make sure youโ€™re on the latest version: ollama update. The Qwen 3.6 tags were added recently.