Mar 26, 2026 · 3 min read

How to Run Qwen 3.5 Locally — Setup Guide for Any Hardware

Qwen 3.5 is fully open-source under Apache 2.0. Every model in the family — from the 0.8B edge model to the 397B flagship — can be downloaded and run on your own hardware for free. Here’s how.

Pick your model size

Model	VRAM/RAM needed	Best for
Qwen3.5-0.8B	2GB	Edge, mobile, IoT
Qwen3.5-2B	4GB	Any laptop
Qwen3.5-4B	6GB	Light tasks on old hardware
Qwen3.5-9B	8GB	Best quality-per-GB ratio
Qwen3.5-27B (Q4)	20GB	Strong all-rounder
Qwen3.5-35B-A3B	8GB	35B knowledge at 3B speed
Qwen3.5-122B-A10B	~40GB	Near-frontier
Qwen3.5-397B (Q4)	~214GB	Full frontier (needs serious hardware)

The 9B model is the sweet spot for most people. It matches GPT-OSS-120B on multiple benchmarks while running on a 16GB laptop.

Method 1: Ollama (easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run any size
ollama run qwen3.5:0.8b   # Tiny — runs on anything
ollama run qwen3.5:9b     # Sweet spot
ollama run qwen3.5:27b    # Needs 24GB GPU
ollama run qwen3.5         # Flagship — needs 214GB+

Ollama downloads the model automatically on first run. It exposes an OpenAI-compatible API at http://localhost:11434/v1.

Use it from Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[{"role": "user", "content": "Explain Docker networking in 3 sentences"}]
)
print(response.choices[0].message.content)

Method 2: llama.cpp (more control)

For users who want to tune quantization, context size, and threading:

# Download a quantized model from HuggingFace
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
  --include "Qwen3.5-9B-Q4_K_M.gguf" \
  --local-dir ./models

# Start the server
llama-server \
  --model ./models/Qwen3.5-9B-Q4_K_M.gguf \
  --ctx-size 8192 \
  --threads 8 \
  --port 8080

llama.cpp gives you fine-grained control over context size, batch size, and quantization. It’s the foundation that Ollama is built on.

Method 3: Hugging Face Transformers (for Python developers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B")

messages = [{"role": "user", "content": "Write a Python web scraper"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

This method gives you full access to the model weights for fine-tuning, custom pipelines, and research.

Connect to your IDE

Once Ollama is running, connect it to VS Code:

Install the Continue extension
Open Continue settings
Add Ollama as a provider with model qwen3.5:9b
Start coding with free, private AI assistance

Works the same way with Cursor, Windsurf, and other AI-enabled editors.

Thinking modes

Qwen 3.5 supports thinking and non-thinking modes locally:

# In Ollama chat, prefix with /think for deep reasoning
/think Prove that the square root of 2 is irrational

# Or /no_think for fast responses
/no_think What's the capital of France?

Use thinking mode for math, complex coding, and hard reasoning. Use fast mode for simple tasks.

Tips for best performance

Quantization matters. Q4_K_M is the best balance of quality and speed for most models. Q8 is higher quality but needs more VRAM.
Context size affects VRAM. A 9B model at 8K context uses less memory than the same model at 32K. Start small and increase if needed.
Apple Silicon is great for this. Unified memory means your Mac can use all its RAM for the model. A 32GB M-series Mac runs 27B models smoothly.
GPU > CPU. Always use GPU inference if available. CPU inference works but is 5-10x slower.

How to Run Qwen 3.5 Locally — Setup Guide for Any Hardware

Pick your model size

Method 1: Ollama (easiest)

Method 2: llama.cpp (more control)

Method 3: Hugging Face Transformers (for Python developers)

Connect to your IDE

Thinking modes

Tips for best performance

Related

📬 Get weekly dev tools & AI tips

You might also like

How to Run GLM-5.1 Locally — Hardware, Setup, and Quantization Guide

How to Replace GitHub Copilot for Free — Step-by-Step Guide (2026)

How to Run AI Without a GPU — CPU-Only Inference Guide (2026)

How to Run DeepSeek Locally — V3 and R1 Setup Guide