🤖 AI Tools
· 3 min read

How to Run Qwen 3.5 Locally — Setup Guide for Any Hardware


Qwen 3.5 is fully open-source under Apache 2.0. Every model in the family — from the 0.8B edge model to the 397B flagship — can be downloaded and run on your own hardware for free. Here’s how.

Pick your model size

ModelVRAM/RAM neededBest for
Qwen3.5-0.8B2GBEdge, mobile, IoT
Qwen3.5-2B4GBAny laptop
Qwen3.5-4B6GBLight tasks on old hardware
Qwen3.5-9B8GBBest quality-per-GB ratio
Qwen3.5-27B (Q4)20GBStrong all-rounder
Qwen3.5-35B-A3B8GB35B knowledge at 3B speed
Qwen3.5-122B-A10B~40GBNear-frontier
Qwen3.5-397B (Q4)~214GBFull frontier (needs serious hardware)

The 9B model is the sweet spot for most people. It matches GPT-OSS-120B on multiple benchmarks while running on a 16GB laptop.

Method 1: Ollama (easiest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run any size
ollama run qwen3.5:0.8b   # Tiny — runs on anything
ollama run qwen3.5:9b     # Sweet spot
ollama run qwen3.5:27b    # Needs 24GB GPU
ollama run qwen3.5         # Flagship — needs 214GB+

Ollama downloads the model automatically on first run. It exposes an OpenAI-compatible API at http://localhost:11434/v1.

Use it from Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="qwen3.5:9b",
    messages=[{"role": "user", "content": "Explain Docker networking in 3 sentences"}]
)
print(response.choices[0].message.content)

Method 2: llama.cpp (more control)

For users who want to tune quantization, context size, and threading:

# Download a quantized model from HuggingFace
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
  --include "Qwen3.5-9B-Q4_K_M.gguf" \
  --local-dir ./models

# Start the server
llama-server \
  --model ./models/Qwen3.5-9B-Q4_K_M.gguf \
  --ctx-size 8192 \
  --threads 8 \
  --port 8080

llama.cpp gives you fine-grained control over context size, batch size, and quantization. It’s the foundation that Ollama is built on.

Method 3: Hugging Face Transformers (for Python developers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B")

messages = [{"role": "user", "content": "Write a Python web scraper"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

This method gives you full access to the model weights for fine-tuning, custom pipelines, and research.

Connect to your IDE

Once Ollama is running, connect it to VS Code:

  1. Install the Continue extension
  2. Open Continue settings
  3. Add Ollama as a provider with model qwen3.5:9b
  4. Start coding with free, private AI assistance

Works the same way with Cursor, Windsurf, and other AI-enabled editors.

Thinking modes

Qwen 3.5 supports thinking and non-thinking modes locally:

# In Ollama chat, prefix with /think for deep reasoning
/think Prove that the square root of 2 is irrational

# Or /no_think for fast responses
/no_think What's the capital of France?

Use thinking mode for math, complex coding, and hard reasoning. Use fast mode for simple tasks.

Tips for best performance

  • Quantization matters. Q4_K_M is the best balance of quality and speed for most models. Q8 is higher quality but needs more VRAM.
  • Context size affects VRAM. A 9B model at 8K context uses less memory than the same model at 32K. Start small and increase if needed.
  • Apple Silicon is great for this. Unified memory means your Mac can use all its RAM for the model. A 32GB M-series Mac runs 27B models smoothly.
  • GPU > CPU. Always use GPU inference if available. CPU inference works but is 5-10x slower.