GLM-5.1 is the best open-source coding model available, but at 754 billion parameters, running it locally is not trivial. Here’s what you actually need and how to set it up.
Hardware reality check
Let’s be honest upfront: GLM-5.1 at full precision requires ~1.5TB of memory. That’s not a typo. Even quantized, you need serious hardware.
| Precision | Memory needed | Hardware example |
|---|---|---|
| FP16 (full) | ~1.5TB | 8x A100 80GB or 4x H100 |
| INT8 | ~750GB | 4x A100 80GB |
| 4-bit (GPTQ/AWQ) | ~200GB | 2x A100 80GB or high-end workstation |
| 2-bit (aggressive) | ~100GB | 1x A100 80GB + system RAM |
For context, a single NVIDIA H100 has 80GB of VRAM. You need multiple GPUs no matter what.
Can I run it on consumer hardware?
Not the full model. But there are options:
- Apple Silicon Mac with 192GB unified memory (Mac Studio Ultra): Can run 4-bit quantized with offloading, but expect slow inference (~2-5 tokens/sec)
- Multi-GPU desktop (4x RTX 4090 = 96GB VRAM): Not enough for even 4-bit. You’d need CPU offloading which kills performance
- Cloud GPU rental: Most practical option for most people
If you have consumer hardware, consider GLM-5-Turbo or the smaller GLM-4.7 instead. Or use the GLM Coding Plan at $3/month.
Option 1: vLLM (recommended for GPU servers)
vLLM is the fastest way to serve GLM-5.1 if you have the hardware.
Install vLLM
pip install vllm
Download the model
huggingface-cli download zai-org/GLM-5.1 --local-dir ./glm-5.1
Warning: this is a large download (hundreds of GB). Make sure you have enough disk space.
Start the server
python -m vllm.entrypoints.openai.api_server \
--model ./glm-5.1 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--port 8000
Adjust --tensor-parallel-size to match your GPU count.
Test it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.1",
"messages": [{"role": "user", "content": "Write a Python function to merge two sorted arrays"}]
}'
Option 2: Quantized with llama.cpp
For lower memory requirements, use a quantized GGUF version:
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Download quantized model (check HuggingFace for available quants)
huggingface-cli download zai-org/GLM-5.1-GGUF --include "*.gguf" --local-dir ./glm-5.1-gguf
# Run server
./llama-server -m ./glm-5.1-gguf/glm-5.1-Q4_K_M.gguf \
-c 8192 \
--host 0.0.0.0 \
--port 8080 \
-ngl 99
The -ngl 99 flag offloads as many layers as possible to GPU. Adjust based on your VRAM.
Option 3: Cloud GPU rental
The most practical option for most developers:
| Provider | GPU | Cost/hour | Can run GLM-5.1? |
|---|---|---|---|
| Lambda Labs | 8x A100 | ~$10/hr | Yes (full precision) |
| RunPod | 4x A100 | ~$6/hr | Yes (INT8) |
| Vast.ai | 2x A100 | ~$3/hr | Yes (4-bit) |
| AWS p5 | 8x H100 | ~$30/hr | Yes (full precision) |
For occasional use, renting a cloud GPU for a few hours is much cheaper than buying hardware. For regular use, the GLM Coding Plan at $3-10/month is more economical.
Option 4: Use a smaller GLM model locally
If you want to stay in the GLM ecosystem but have limited hardware:
| Model | Size | Min VRAM | Runs on |
|---|---|---|---|
| GLM-4.5-Air | Small | 8GB | RTX 3060+ |
| GLM-4.7 | Medium | 16GB | RTX 4070+ |
| GLM-5-Turbo | Large | 24GB+ | RTX 4090 |
These won’t match GLM-5.1’s performance, but they’re practical for local development.
Connecting to Claude Code
Once your local server is running, point Claude Code at it:
export ANTHROPIC_BASE_URL="http://localhost:8000/v1"
export ANTHROPIC_API_KEY="dummy"
claude
See our full Claude Code setup guide for details.
Performance expectations
Running GLM-5.1 locally, expect:
- Full precision on 8x A100: ~30-50 tokens/sec (good for interactive use)
- 4-bit on 2x A100: ~15-25 tokens/sec (usable)
- 4-bit on Mac Studio Ultra 192GB: ~2-5 tokens/sec (slow but functional)
- CPU offloading: <1 token/sec (not practical)
For comparison, Z.ai’s API typically delivers 20-40 tokens/sec with no hardware investment.
Bottom line
Running GLM-5.1 locally is possible but requires enterprise-grade hardware. For most developers, the practical options are:
- GLM Coding Plan ($3/month) — Best value
- Cloud GPU rental (~$3-10/hr) — For occasional heavy use
- Smaller GLM models locally — For daily development on consumer hardware
- Self-hosted on owned hardware — For teams with existing GPU infrastructure
The MIT license means you have full freedom to deploy however you want. The question is whether the hardware cost makes sense vs the API pricing.
Related: GLM-5.1 Complete Guide · Best GPU for AI Locally · Best AI Models for Mac