Apr 30, 2026 · 5 min read

How to Run Qwen 3.6 Locally — Ollama, LM Studio & vLLM (2026)

Qwen 3.6-35B-A3B is one of the most capable models you can run on consumer hardware right now. It scores 73.4% on SWE-bench Verified — on par with frontier API models — yet it runs on a MacBook Pro or a single GPU thanks to its Mixture-of-Experts architecture: 35 billion total parameters, but only 3 billion active at inference time.

Update (April 23, 2026): Alibaba released Qwen 3.6-27B, a 27B dense model that runs on a Mac with 22GB VRAM. See our 27B local setup guide for hardware requirements and Ollama/vLLM setup.

It’s Apache 2.0 licensed, supports 262K context (extensible to 1M via YaRN), handles vision (image + video), and has thinking mode enabled by default. Simon Willison has been running it on his M5 MacBook via LM Studio. There’s no reason you can’t do the same.

This guide covers three ways to get it running: Ollama (fastest setup), LM Studio (GUI), and vLLM (production serving). For a deeper look at the model itself, see our Qwen 3.6-35B-A3B complete guide.

Hardware Requirements

You don’t need a data center. The MoE architecture keeps memory usage surprisingly low.

Tier	Hardware	Quantization	Experience
Minimum	16GB RAM (M-series Mac) or 16GB VRAM GPU	Q3_K_S (~14GB)	Usable, some quality loss. Short context only.
Recommended	24GB+ RAM (Mac) or 24GB VRAM GPU (RTX 3090/4090)	Q4_K_S (~21GB)	Good quality, solid speed. Best bang for buck.
Ideal	32GB+ RAM (Mac) or 48GB+ VRAM	Q4_K_M (~24GB) or higher	Near-full quality, long context, fast inference.

CPU-only inference works but is slow — expect 1-3 tokens/second. If your hardware doesn’t meet the recommended tier, cloud GPU providers let you rent RTX 4090s or A100s by the hour for a few dollars. For more on VRAM planning, see How Much VRAM Do You Need for AI?

Which Quantization to Pick

All GGUF quantizations below are available from Unsloth on Hugging Face.

Quantization	File Size	Quality	Best For
Q3_K_S	~14GB	Lower — noticeable degradation	16GB machines, testing only
Q4_K_S	~21GB	Good — minimal quality loss	24GB VRAM / 24GB+ Mac RAM (recommended)
Q4_K_M	~24GB	Better — slightly sharper reasoning	32GB+ Mac RAM / 24GB VRAM with tight fit
Q5_K_M	~28GB	Near-original	48GB+ VRAM or 36GB+ Mac RAM

For most people: Q4_K_S is the sweet spot. It fits comfortably in 24GB and retains strong coding and reasoning performance.

Method 1: Ollama (Easiest)

Ollama handles downloading, quantization selection, and serving in one step. Three commands and you’re running.

1. Install Ollama (if you haven’t already):

curl -fsSL https://ollama.com/install.sh | sh

2. Pull and run the model:

ollama run qwen3.6:35b-a3b

That’s it. Ollama auto-selects a quantization that fits your hardware. To force a specific quant:

ollama run qwen3.6:35b-a3b-q4_K_S

3. Verify it’s working:

The model will drop you into an interactive chat. Thinking mode is on by default — you’ll see the model’s reasoning before its answer. To disable thinking for faster responses, add /no_think to your prompt.

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 automatically, so you can point any tool at it immediately.

If you hit memory issues, check our Ollama out of memory fix guide.

Method 2: LM Studio (GUI)

LM Studio gives you a visual interface for downloading, configuring, and chatting with local models. Great if you prefer not to use the terminal.

Download and install LM Studio for your platform.
Open the app and click the Search bar.
Search for “Qwen3.6-35B-A3B”.
Look for the Unsloth GGUF uploads. Select Q4_K_S (or Q3_K_S if you’re on 16GB).
Click Download and wait for it to finish (~21GB for Q4_K_S).
Go to the Chat tab, select the downloaded model, and start chatting.

To use it as a local API server: go to the Server tab, load the model, and start the server. It exposes an OpenAI-compatible endpoint at http://localhost:1234/v1.

Simon Willison confirmed this works well on Apple Silicon — he ran it on his M5 MacBook with solid performance.

Method 3: vLLM (Production Serving)

For serving Qwen 3.6 to multiple users or integrating into a pipeline, vLLM gives you high-throughput inference with continuous batching.

1. Install vLLM:

pip install vllm

2. Start the server:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 1 \
  --max-model-len 262144

This loads the full-precision model from Hugging Face. For multi-GPU setups, increase --tensor-parallel-size. The server exposes an OpenAI-compatible API at http://localhost:8000/v1.

Note: vLLM requires a CUDA GPU. For Mac users, stick with Ollama or LM Studio. SGLang is also supported as an alternative serving backend.

Recommended Sampling Settings

These settings work well for coding and reasoning tasks with Qwen 3.6:

Parameter	Coding / Reasoning	Creative / Chat
Temperature	0.6	0.8
Top-P	0.95	0.95
Top-K	20	40
Min-P	0.0	0.0
Thinking	On (default)	Off for speed

Keep thinking mode on for complex tasks — it significantly improves accuracy. Disable it with /no_think in Ollama or via the enable_thinking=False parameter in the API for simple Q&A where speed matters more.

Using With Coding Tools

Qwen 3.6 works as a drop-in backend for popular local coding tools. Point them at your Ollama or LM Studio API:

Aider: aider --model ollama/qwen3.6:35b-a3b — works out of the box. The 73.4% SWE-bench score means it handles real-world code edits well.
Continue.dev: Add an Ollama provider in your VS Code settings, select qwen3.6:35b-a3b as the model.
Open WebUI: Connect to http://localhost:11434 and the model appears automatically.

For more local coding model options, see Best AI Models for Coding Locally in 2026.

Troubleshooting

Out of memory: Switch to a smaller quantization (Q3_K_S at ~14GB). Close other apps. On Mac, check Activity Monitor for memory pressure. See our Ollama out of memory fix for detailed solutions.

Slow inference (< 2 tok/s): You’re likely running on CPU. Ensure Ollama detects your GPU with ollama ps. On Mac, make sure you’re on Apple Silicon — Intel Macs will be painfully slow. Disable thinking mode for 2-3x speed improvement on simple tasks.

Context too long errors: The model supports 262K natively, but your available RAM limits effective context. Reduce --max-model-len in vLLM or use num_ctx in Ollama’s Modelfile. For most local use, 8K-32K context is practical.

Model not found in Ollama: Make sure you’re on the latest version: ollama update. The Qwen 3.6 tags were added recently.

Kimi K3 Complete Guide — 2.8T open-weight frontier model
Llama 4 Complete Guide — Meta’s open-weight model family
How to Run Kimi K2.7 Locally — self-hosting Kimi models
Claude Sonnet 5 Complete Guide — Anthropic’s latest model
Qwen 3.6-35B-A3B Complete Guide — benchmarks, architecture, and comparisons
Qwen 3.6 Complete Guide — full model family overview
Ollama Complete Guide 2026 — everything about Ollama
How Much VRAM Do You Need for AI? — VRAM calculator and planning

How to Run Qwen 3.6 Locally — Ollama, LM Studio & vLLM (2026)

Hardware Requirements

Which Quantization to Pick

Method 1: Ollama (Easiest)

Method 2: LM Studio (GUI)

Method 3: vLLM (Production Serving)

Recommended Sampling Settings

Using With Coding Tools

Troubleshooting

Related Links

📬 AI Dev Weekly

You might also like

Build an AI-Powered Cron Job Monitor That Explains Failures

Build a Local AI Image Describer — Vision Models + Ollama

Build an AI Expense Tracker That Reads Your Bank CSV Files

Build a CLI That Generates README Files From Your Code