πŸ“ Tutorials
Β· 8 min read

How to Run openPangu 2.0 Locally: Ascend and GPU Setup Guide (2026)


Running openPangu 2.0 locally means no API costs, no data leaving your network, and full control over the model. The question is whether your hardware can handle it. With two versions available β€” Pro at 505B total parameters and Flash at 92B total β€” the answer depends heavily on which version you target and what hardware you have.

Flash with its 6B active parameters is the realistic local deployment target for most developers. Pro requires serious infrastructure. This guide covers both paths: the native Ascend NPU setup and the NVIDIA GPU path that most Western developers will follow.

For a full overview of what openPangu 2.0 is and why it matters, see our complete guide. For help choosing between Pro and Flash, check the Pro vs Flash comparison.

Hardware requirements summary

Before you start downloading weights, make sure your hardware qualifies.

openPangu 2.0 Flash (92B total, 6B active):

PrecisionMemory RequiredMinimum Setup
FP16~184GB3x 80GB GPUs or 3x Ascend 910B
INT8~92GB2x 80GB GPUs or 2x Ascend 910B
INT4 (GPTQ/AWQ)~46GB1x 80GB GPU or 2x 24GB GPUs
GGUF Q4_K_M~50-55GB2x 24GB GPUs + CPU offload

openPangu 2.0 Pro (505B total, 18B active):

PrecisionMemory RequiredMinimum Setup
FP16~1TB16x 80GB GPUs or 8x Ascend 910B (128GB)
INT8~505GB8x 80GB GPUs
INT4~253GB4x 80GB GPUs

Pro is a datacenter model. Flash is the one you can realistically run at home or on a single workstation. The rest of this guide focuses primarily on Flash, with Pro notes where relevant.

Path 1: Ascend NPU setup (native)

If you have access to Huawei Ascend hardware, this is the optimal path. The model was trained on Ascend and the software stack is optimized for it.

Prerequisites:

  • Ascend 910B NPU(s) with sufficient memory
  • CANN (Compute Architecture for Neural Networks) 8.0+
  • MindSpore 2.4+ or PyTorch with Ascend backend
  • Ubuntu 20.04/22.04 or EulerOS

Step 1: Install CANN toolkit

# Download CANN from Huawei Ascend community
# https://www.hiascend.com/software/cann
chmod +x Ascend-cann-toolkit_8.0.RC1_linux-x86_64.run
./Ascend-cann-toolkit_8.0.RC1_linux-x86_64.run --install

# Set environment variables
source /usr/local/Ascend/ascend-toolkit/set_env.sh

Step 2: Install MindSpore or PyTorch-Ascend

# Option A: MindSpore (native Huawei framework)
pip install mindspore-ascend==2.4.0

# Option B: PyTorch with Ascend backend
pip install torch==2.3.0
pip install torch_npu==2.3.0

Step 3: Download model weights

# From Huawei's model repository
# Flash model
huggingface-cli download huawei/openPangu-2.0-Flash --local-dir ./openpangu-flash

# Or from Huawei's own ModelScope-equivalent
# modelarts-cli download openpangu-2.0-flash

Step 4: Run inference

from mindspore import context
from openpangu import PanguModel, PanguTokenizer

context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")

tokenizer = PanguTokenizer.from_pretrained("./openpangu-flash")
model = PanguModel.from_pretrained("./openpangu-flash", device_map="auto")

inputs = tokenizer("Explain the MoE architecture in openPangu 2.0:", return_tensors="ms")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

The native Ascend path gives you the best performance per watt and is the intended deployment target. If you are building sovereign infrastructure, this is the path you want.

Path 2: NVIDIA GPU setup

Most developers have NVIDIA GPUs. While openPangu 2.0 was trained on Ascend, the model weights can be converted to standard formats for NVIDIA inference. Community conversions to safetensors and GGUF formats are expected within days of release.

Prerequisites:

  • NVIDIA GPU(s) with sufficient VRAM (see table above)
  • CUDA 12.x + cuDNN
  • Python 3.10+
  • PyTorch 2.3+ or vLLM/llama.cpp

Step 1: Environment setup

# Create conda environment
conda create -n openpangu python=3.11
conda activate openpangu

# Install PyTorch with CUDA
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu121

# Install inference framework (pick one)
pip install vllm>=0.5.0          # For production serving
pip install transformers>=4.44    # For experimentation

Step 2: Download weights (safetensors format)

# Community conversion (expected within days of launch)
huggingface-cli download huawei-community/openPangu-2.0-Flash \
  --local-dir ./openpangu-flash-safetensors

Step 3: Run with vLLM (recommended for production)

# Serve with vLLM
vllm serve ./openpangu-flash-safetensors \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --dtype auto

vLLM handles the MoE routing efficiently and gives you an OpenAI-compatible API out of the box. For a comparison of serving frameworks, see our vLLM vs Ollama vs llama.cpp vs TGI guide.

Step 4: Run with transformers (for experimentation)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "./openpangu-flash-safetensors"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True  # MoE routing may need custom code
)

prompt = "Write a Python function that implements binary search with error handling:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Path 3: llama.cpp with GGUF (consumer hardware)

For running Flash on consumer GPUs (2x RTX 4090, or even partial CPU offload), GGUF quantization through llama.cpp is the most accessible path.

Prerequisites:

  • llama.cpp compiled with CUDA support
  • 48GB+ combined GPU VRAM (for Q4 quantization)
  • 64GB+ system RAM (for partial offload)
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Download GGUF quantized model (community upload expected)
huggingface-cli download huawei-community/openPangu-2.0-Flash-GGUF \
  openPangu-Flash-Q4_K_M.gguf --local-dir ./models

# Run with GPU offload
./build/bin/llama-server \
  -m ./models/openPangu-Flash-Q4_K_M.gguf \
  -ngl 99 \
  -c 16384 \
  --host 0.0.0.0 --port 8080

With Q4_K_M quantization, Flash should fit in approximately 50-55GB. Two RTX 4090s (48GB combined VRAM) plus some CPU offload should handle it. Quality degradation from INT4 quantization is typically 1-3% on benchmarks β€” acceptable for most applications.

For context on the VRAM math, check our how much VRAM for AI models guide.

Optimizing inference performance

Once you have the model running, here are key optimizations:

Batch size tuning: MoE models benefit from batching because different tokens in a batch may route to different experts, distributing compute more evenly across the GPU. Start with batch size 8 and increase until memory is saturated.

Context length management: 512K is the maximum, but you do not need to allocate for it on every request. Set --max-model-len in vLLM to the maximum you actually need. Lower context allocation means more concurrent requests.

Expert offloading (for memory-constrained setups): Some inference frameworks support keeping inactive experts in CPU RAM and loading them on-demand. This trades latency for memory efficiency β€” useful if you are just barely fitting the model.

KV cache quantization: For long-context requests, the KV cache can consume significant memory. Enable KV cache quantization (INT8) to reduce memory pressure without meaningful quality loss.

# vLLM with KV cache quantization
vllm serve ./openpangu-flash \
  --kv-cache-dtype fp8 \
  --max-model-len 262144

Pro deployment (datacenter scale)

If you are deploying Pro, you are likely working with a cluster. Key considerations:

  • Tensor parallelism: Split across 4-8 GPUs within a node
  • Pipeline parallelism: Split across nodes for very large batch inference
  • Expert parallelism: MoE-aware sharding that distributes experts across devices
# vLLM Pro with 8-GPU tensor parallelism
vllm serve ./openpangu-pro \
  --tensor-parallel-size 8 \
  --max-model-len 65536 \
  --dtype bfloat16

For enterprise deployment patterns, see our self-hosted AI enterprise guide. For cloud GPU options if you do not have on-premise hardware, check best cloud GPU providers 2026.

Troubleshooting common issues

Out of memory errors:

  • Reduce --max-model-len
  • Enable quantization (INT8 or INT4)
  • Add --gpu-memory-utilization 0.9 (leave headroom)
  • Enable CPU offload for some layers

Slow generation:

  • Check GPU utilization β€” if below 80%, increase batch size
  • Ensure tensor parallelism matches your GPU count
  • Verify you are using the correct dtype (bfloat16, not float32)

MoE routing errors:

  • Ensure trust_remote_code=True is set
  • Check that all expert weights loaded correctly
  • Verify framework version supports the specific MoE implementation

CANN/Ascend driver issues:

  • Match CANN version to your driver version exactly
  • Run npu-smi info to verify hardware is detected
  • Check Ascend community forums for known compatibility issues

Performance benchmarks (self-hosted)

Expected throughput on common hardware configurations (Flash, Q4 quantization):

HardwareTokens/sec (generation)Concurrent users
2x RTX 4090 (24GB each)~30-401-2
1x A100 80GB~60-804-6
2x Ascend 910B~80-1006-8
4x H100 80GB~150-20012-16

These are estimates based on the 6B active parameter count and typical MoE overhead. Actual performance will vary based on request length, batch size, and optimization level.

When to self-host vs use the API

Not every situation calls for local deployment. Consider the API route via ModelArts if:

  • You do not have suitable hardware
  • Your volume is low (under 1M tokens/day)
  • You want zero maintenance overhead
  • You need instant scaling

Self-host when:

  • Data sovereignty is non-negotiable
  • Volume justifies hardware investment
  • Latency requirements demand local inference
  • You need fine-tuned versions

For a framework to make this decision, see when to switch from API to self-hosted.

FAQ

Can I run openPangu 2.0 Flash on a single RTX 4090?

Not at full precision (needs ~184GB), but with INT4 quantization (~46GB), you would need approximately two RTX 4090s (48GB combined). A single 4090 with aggressive quantization and significant CPU offload might work but with very slow generation speeds. Two 4090s is the realistic minimum for usable performance.

Does openPangu 2.0 work with Ollama?

Ollama support depends on GGUF format availability. Once community members convert the weights to GGUF (expected within days of launch), Ollama should support Flash through its standard model import mechanism. Pro is too large for typical Ollama use cases.

Is the Ascend path faster than NVIDIA for inference?

For openPangu specifically, yes β€” the model architecture is optimized for Ascend’s compute patterns. The CANN stack handles the MoE routing natively. On NVIDIA, there may be slight overhead from framework translation. In practice, the difference is probably 10-20% in favor of Ascend for this specific model.

How much disk space do I need for the model files?

Flash in FP16: approximately 184GB. Flash in Q4 GGUF: approximately 50-55GB. Pro in FP16: approximately 1TB. Pro in Q4: approximately 250GB. Budget additional space for tokenizer files, configuration, and temporary files during conversion.

Can I fine-tune openPangu 2.0 Flash locally?

LoRA fine-tuning of Flash is feasible on 2-4 GPUs with 80GB each. Full fine-tuning requires significantly more resources. QLoRA (quantized LoRA) could work on more modest setups. The open-source license explicitly permits fine-tuning and derivative works.

What about Apple Silicon (M-series Macs)?

MLX or llama.cpp with Metal backend should support Flash once GGUF conversions are available. An M4 Ultra with 192GB unified memory could theoretically run Flash at FP16. For most M-series Macs, quantized versions via llama.cpp with Metal acceleration would be the path. Expect 15-25 tokens/second on high-end Apple Silicon.