🤖 AI Tools
· 3 min read

How to Run GLM-5.1 Locally — Hardware, Setup, and Quantization Guide


GLM-5.1 is the best open-source coding model available, but at 754 billion parameters, running it locally is not trivial. Here’s what you actually need and how to set it up.

Hardware reality check

Let’s be honest upfront: GLM-5.1 at full precision requires ~1.5TB of memory. That’s not a typo. Even quantized, you need serious hardware.

PrecisionMemory neededHardware example
FP16 (full)~1.5TB8x A100 80GB or 4x H100
INT8~750GB4x A100 80GB
4-bit (GPTQ/AWQ)~200GB2x A100 80GB or high-end workstation
2-bit (aggressive)~100GB1x A100 80GB + system RAM

For context, a single NVIDIA H100 has 80GB of VRAM. You need multiple GPUs no matter what.

Can I run it on consumer hardware?

Not the full model. But there are options:

  • Apple Silicon Mac with 192GB unified memory (Mac Studio Ultra): Can run 4-bit quantized with offloading, but expect slow inference (~2-5 tokens/sec)
  • Multi-GPU desktop (4x RTX 4090 = 96GB VRAM): Not enough for even 4-bit. You’d need CPU offloading which kills performance
  • Cloud GPU rental: Most practical option for most people

If you have consumer hardware, consider GLM-5-Turbo or the smaller GLM-4.7 instead. Or use the GLM Coding Plan at $3/month.

vLLM is the fastest way to serve GLM-5.1 if you have the hardware.

Install vLLM

pip install vllm

Download the model

huggingface-cli download zai-org/GLM-5.1 --local-dir ./glm-5.1

Warning: this is a large download (hundreds of GB). Make sure you have enough disk space.

Start the server

python -m vllm.entrypoints.openai.api_server \
  --model ./glm-5.1 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --port 8000

Adjust --tensor-parallel-size to match your GPU count.

Test it

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [{"role": "user", "content": "Write a Python function to merge two sorted arrays"}]
  }'

Option 2: Quantized with llama.cpp

For lower memory requirements, use a quantized GGUF version:

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Download quantized model (check HuggingFace for available quants)
huggingface-cli download zai-org/GLM-5.1-GGUF --include "*.gguf" --local-dir ./glm-5.1-gguf

# Run server
./llama-server -m ./glm-5.1-gguf/glm-5.1-Q4_K_M.gguf \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99

The -ngl 99 flag offloads as many layers as possible to GPU. Adjust based on your VRAM.

Option 3: Cloud GPU rental

The most practical option for most developers:

ProviderGPUCost/hourCan run GLM-5.1?
Lambda Labs8x A100~$10/hrYes (full precision)
RunPod4x A100~$6/hrYes (INT8)
Vast.ai2x A100~$3/hrYes (4-bit)
AWS p58x H100~$30/hrYes (full precision)

For occasional use, renting a cloud GPU for a few hours is much cheaper than buying hardware. For regular use, the GLM Coding Plan at $3-10/month is more economical.

Option 4: Use a smaller GLM model locally

If you want to stay in the GLM ecosystem but have limited hardware:

ModelSizeMin VRAMRuns on
GLM-4.5-AirSmall8GBRTX 3060+
GLM-4.7Medium16GBRTX 4070+
GLM-5-TurboLarge24GB+RTX 4090

These won’t match GLM-5.1’s performance, but they’re practical for local development.

Connecting to Claude Code

Once your local server is running, point Claude Code at it:

export ANTHROPIC_BASE_URL="http://localhost:8000/v1"
export ANTHROPIC_API_KEY="dummy"
claude

See our full Claude Code setup guide for details.

Performance expectations

Running GLM-5.1 locally, expect:

  • Full precision on 8x A100: ~30-50 tokens/sec (good for interactive use)
  • 4-bit on 2x A100: ~15-25 tokens/sec (usable)
  • 4-bit on Mac Studio Ultra 192GB: ~2-5 tokens/sec (slow but functional)
  • CPU offloading: <1 token/sec (not practical)

For comparison, Z.ai’s API typically delivers 20-40 tokens/sec with no hardware investment.

Bottom line

Running GLM-5.1 locally is possible but requires enterprise-grade hardware. For most developers, the practical options are:

  1. GLM Coding Plan ($3/month) — Best value
  2. Cloud GPU rental (~$3-10/hr) — For occasional heavy use
  3. Smaller GLM models locally — For daily development on consumer hardware
  4. Self-hosted on owned hardware — For teams with existing GPU infrastructure

The MIT license means you have full freedom to deploy however you want. The question is whether the hardware cost makes sense vs the API pricing.

Related: GLM-5.1 Complete Guide · Best GPU for AI Locally · Best AI Models for Mac