🤖 AI Tools
· 5 min read

How to Run Qwen 3.7 Locally: What's Available and What's Coming


You cannot run Qwen 3.7 locally right now. Both Qwen3.7-Max and Qwen3.7-Plus are closed-weights, API-only models. There are no GGUF files, no Ollama models, no HuggingFace weights. Not yet.

But based on Alibaba’s release pattern, open-weight variants are coming. Here’s what you can do today, what to expect, and how to prepare.

Why Qwen 3.7 isn’t available locally

Alibaba follows a consistent pattern: ship the API first, release open weights later.

  • Qwen 3.6: API launched late March 2026. Open-weight 35B-A3B released April 17 (about 3 weeks later). 27B dense released April 23.
  • Qwen 3.7: API launched May 20-21, 2026. Open weights: TBD.

This is a monetization strategy. The API generates revenue while open weights drive community adoption and ecosystem growth. Both serve Alibaba’s interests, just on different timelines.

What you CAN run locally right now

If you need a Qwen model running on your own hardware today, use the Qwen 3.6 open-weight models:

The Qwen 3.6 35B-A3B is a 35B parameter MoE model with only 3B active parameters. It’s fast, efficient, and Apache 2.0 licensed.

  • Parameters: 35B total, 3B active
  • License: Apache 2.0
  • SWE-bench Verified: 73.4%
  • VRAM needed: ~21 GB (Q4 quantized)
  • Runs on: Mac M-series with 32GB+ RAM, or any GPU with 24GB+ VRAM

Qwen 3.6 27B (dense)

The Qwen 3.6 27B is a dense model that scores 77.2% on SWE-bench Verified, actually beating the larger flagship.

  • Parameters: 27B dense
  • License: Apache 2.0
  • SWE-bench Verified: 77.2%
  • VRAM needed: ~18 GB (Q4 quantized)
  • Runs on: Mac M-series with 32GB+ RAM, or any GPU with 24GB+ VRAM

Setting up Ollama with Qwen 3.6

Ollama is the easiest way to run Qwen models locally:

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull and run Qwen 3.6 35B-A3B

# Pull the model
ollama pull qwen3.6:35b-a3b

# Run interactively
ollama run qwen3.6:35b-a3b

# Or serve as an API
ollama serve

Use as an OpenAI-compatible API

from openai import OpenAI

client = OpenAI(
    api_key="ollama",
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="qwen3.6:35b-a3b",
    messages=[
        {"role": "user", "content": "Refactor this function to use async/await."}
    ]
)

print(response.choices[0].message.content)

Pull and run Qwen 3.6 27B

ollama pull qwen3.6:27b
ollama run qwen3.6:27b

Setting up vLLM with Qwen 3.6

For production-grade local inference with higher throughput:

# Install vLLM
pip install vllm

# Serve Qwen 3.6 27B
vllm serve Qwen/Qwen3.6-27B \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --port 8000

Then query it like any OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-27B",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

What to expect from Qwen 3.7 open weights

Based on the 3.6 pattern, here’s what’s likely coming:

Expected timeline

  • API launch: May 20-21, 2026 (done)
  • First open-weight model: Likely June 2026 (3-4 weeks after API)
  • Dense variant: Likely June-July 2026

Expected model sizes

Qwen 3.6 released these open-weight variants:

  • 35B-A3B (MoE, 3B active)
  • 27B (dense)

Qwen 3.7 will likely follow a similar pattern. Expect:

  • A MoE variant (possibly larger than 35B given the capability jump)
  • A dense variant in the 27B-70B range
  • Apache 2.0 licensing (Alibaba’s standard for open weights)

Hardware requirements (estimated)

When Qwen 3.7 open weights drop, you’ll likely need:

Model (estimated)Q4 VRAMQ8 VRAMFP16 VRAM
~35B MoE (3B active)~21 GB~35 GB~70 GB
~27B dense~18 GB~27 GB~54 GB
~70B dense (if released)~45 GB~70 GB~140 GB

For the MoE variant, a Mac with 32GB unified memory or an RTX 4090 (24GB) should work with Q4 quantization. The dense 27B variant has similar requirements.

How to get notified when 3.7 open weights release

  1. Watch the HuggingFace org: huggingface.co/Qwen publishes all open-weight releases
  2. Follow @Alibaba_Qwen on X/Twitter: They announce releases there first
  3. Check Ollama library: New models appear at ollama.com/library within days of release
  4. Monitor this blog: We’ll publish a setup guide as soon as weights drop

Qwen 3.7 API as a bridge

While waiting for open weights, you can use the Qwen 3.7 API at $2.50/1M input tokens. This gives you access to the full 3.7 capabilities without local hardware requirements.

For a complete overview of what Qwen 3.7 offers, see our complete guide.

Performance Tips

Getting the best performance from local Qwen models requires tuning a few key parameters:

  1. Use quantization wisely. Q4_K_M offers the best balance of quality and speed for most hardware. Only use Q8 or FP16 if you have VRAM to spare and need maximum accuracy.

  2. Reduce context length for faster inference. If you don’t need the full 32K context, set --ctx-size 8192 or --ctx-size 4096. Shorter context means less memory usage and faster token generation.

  3. Tune batch size to your GPU. Larger batch sizes (--batch-size 512 or 1024) improve throughput on high-VRAM GPUs. On constrained hardware, reduce to 256 or 128 to avoid OOM errors.

  4. Monitor GPU memory and offload layers. Use --n-gpu-layers to control how many layers run on GPU vs CPU. Start high and reduce if you hit memory limits. Partial offloading is better than swapping.

  5. Disable KV cache quantization for quality-sensitive tasks. KV cache quantization (Q8 or Q4) saves memory but can degrade output quality on long generations. Keep it off for coding tasks where precision matters.

FAQ

When will Qwen 3.7 open weights be released?

No official date. Based on the 3.6 pattern (API first, open weights 3-4 weeks later), expect sometime in June 2026. Alibaba hasn’t confirmed this.

Can I run Qwen 3.7 on my GPU?

Not yet. When open weights release, a Q4-quantized MoE variant should fit on 24GB VRAM (RTX 4090, RTX 3090). A dense variant around 27B would need similar VRAM.

Can I run Qwen 3.7 on a Mac?

Not yet. When open weights release, a Mac with 32GB+ unified memory should handle the quantized MoE variant. M4 Pro/Max with 48GB+ would be comfortable.

What’s the best Qwen model I can run locally today?

Qwen 3.6 27B for raw capability (77.2% SWE-bench). Qwen 3.6 35B-A3B for efficiency (3B active parameters, faster inference).

Will Qwen 3.7 open weights be Apache 2.0?

Likely yes. Alibaba has consistently used Apache 2.0 for their open-weight releases (Qwen 3.5, 3.6 35B-A3B, 3.6 27B). There’s no reason to expect a change.

Should I wait for 3.7 or use 3.6 now?

Use 3.6 now. It’s available, it works, and it’s good. When 3.7 open weights drop, you can switch. The Ollama/vLLM setup will be nearly identical, just a different model name.

How does local Qwen 3.6 compare to Qwen 3.7 API?

The API version (3.7 Max) is significantly stronger: 50.8% vs ~44% on Terminal-Bench Hard, 1M vs 32K effective context locally, and better tool calling. But local gives you privacy, no per-token costs, and offline access. It depends on your priorities.