Qwen 3.5 is fully open-source under Apache 2.0. Every model in the family — from the 0.8B edge model to the 397B flagship — can be downloaded and run on your own hardware for free. Here’s how.
Pick your model size
| Model | VRAM/RAM needed | Best for |
|---|---|---|
| Qwen3.5-0.8B | 2GB | Edge, mobile, IoT |
| Qwen3.5-2B | 4GB | Any laptop |
| Qwen3.5-4B | 6GB | Light tasks on old hardware |
| Qwen3.5-9B | 8GB | Best quality-per-GB ratio |
| Qwen3.5-27B (Q4) | 20GB | Strong all-rounder |
| Qwen3.5-35B-A3B | 8GB | 35B knowledge at 3B speed |
| Qwen3.5-122B-A10B | ~40GB | Near-frontier |
| Qwen3.5-397B (Q4) | ~214GB | Full frontier (needs serious hardware) |
The 9B model is the sweet spot for most people. It matches GPT-OSS-120B on multiple benchmarks while running on a 16GB laptop.
Method 1: Ollama (easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run any size
ollama run qwen3.5:0.8b # Tiny — runs on anything
ollama run qwen3.5:9b # Sweet spot
ollama run qwen3.5:27b # Needs 24GB GPU
ollama run qwen3.5 # Flagship — needs 214GB+
Ollama downloads the model automatically on first run. It exposes an OpenAI-compatible API at http://localhost:11434/v1.
Use it from Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
response = client.chat.completions.create(
model="qwen3.5:9b",
messages=[{"role": "user", "content": "Explain Docker networking in 3 sentences"}]
)
print(response.choices[0].message.content)
Method 2: llama.cpp (more control)
For users who want to tune quantization, context size, and threading:
# Download a quantized model from HuggingFace
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
--include "Qwen3.5-9B-Q4_K_M.gguf" \
--local-dir ./models
# Start the server
llama-server \
--model ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 8192 \
--threads 8 \
--port 8080
llama.cpp gives you fine-grained control over context size, batch size, and quantization. It’s the foundation that Ollama is built on.
Method 3: Hugging Face Transformers (for Python developers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B")
messages = [{"role": "user", "content": "Write a Python web scraper"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))
This method gives you full access to the model weights for fine-tuning, custom pipelines, and research.
Connect to your IDE
Once Ollama is running, connect it to VS Code:
- Install the Continue extension
- Open Continue settings
- Add Ollama as a provider with model
qwen3.5:9b - Start coding with free, private AI assistance
Works the same way with Cursor, Windsurf, and other AI-enabled editors.
Thinking modes
Qwen 3.5 supports thinking and non-thinking modes locally:
# In Ollama chat, prefix with /think for deep reasoning
/think Prove that the square root of 2 is irrational
# Or /no_think for fast responses
/no_think What's the capital of France?
Use thinking mode for math, complex coding, and hard reasoning. Use fast mode for simple tasks.
Tips for best performance
- Quantization matters. Q4_K_M is the best balance of quality and speed for most models. Q8 is higher quality but needs more VRAM.
- Context size affects VRAM. A 9B model at 8K context uses less memory than the same model at 32K. Start small and increase if needed.
- Apple Silicon is great for this. Unified memory means your Mac can use all its RAM for the model. A 32GB M-series Mac runs 27B models smoothly.
- GPU > CPU. Always use GPU inference if available. CPU inference works but is 5-10x slower.