πŸ“ Tutorials
Β· 8 min read

openPangu 2.0 API Guide: Access Huawei's Model via ModelArts (2026)


The fastest way to use openPangu 2.0 is through Huawei Cloud’s ModelArts platform. No hardware to provision, no weights to download, no inference server to manage. You get an API endpoint, send prompts, and receive completions. Both the Pro (505B/18B active) and Flash (92B/6B active) variants are accessible this way.

This guide walks through the complete setup: creating your Huawei Cloud account, getting API credentials, making your first request, and integrating openPangu 2.0 into production applications.

For background on the model itself β€” architecture, context window, and what makes it unique β€” start with our openPangu 2.0 complete guide. For help choosing between Pro and Flash, see the version comparison.

Prerequisites

Before you start:

  • Valid email address for Huawei Cloud account registration
  • Payment method (credit card or enterprise billing β€” some free-tier access may be available)
  • Basic familiarity with REST APIs
  • Python 3.8+ (for code examples) or any HTTP client

Step 1: Create a Huawei Cloud account

  1. Navigate to huaweicloud.com and click β€œRegister”
  2. Choose individual or enterprise account (enterprise gets higher rate limits)
  3. Complete identity verification (email + phone for individuals, business documentation for enterprise)
  4. Enable ModelArts service in your console

Regional availability note: Huawei Cloud has regions in China, Asia-Pacific, Europe, Latin America, and Africa. Choose a region close to your users for lowest latency. ModelArts AI Gallery availability may vary by region at launch.

Step 2: Get API credentials

Huawei Cloud uses Access Key (AK) and Secret Key (SK) pairs for authentication, similar to AWS IAM credentials.

  1. Go to My Credentials β†’ Access Keys in the Huawei Cloud console
  2. Click Create Access Key
  3. Download and securely store the AK/SK pair
  4. Note your project ID and region (needed for API endpoints)

Alternatively, for temporary credentials:

  1. Create an IAM user with ModelArts permissions
  2. Use token-based authentication via the IAM API
  3. Tokens expire after 24 hours (better for production security)
  1. Navigate to ModelArts β†’ AI Gallery in the console
  2. Search for β€œopenPangu 2.0”
  3. Select either openPangu 2.0 Pro or openPangu 2.0 Flash
  4. Subscribe to the model (accept terms of service)
  5. Note the endpoint URL provided after subscription

Step 4: Make your first API request

Using cURL:

# Set your credentials
export HUAWEI_AK="your-access-key"
export HUAWEI_SK="your-secret-key"
export HUAWEI_PROJECT_ID="your-project-id"
export HUAWEI_REGION="cn-north-4"  # or your selected region

# Get IAM token
TOKEN=$(curl -s -X POST "https://iam.${HUAWEI_REGION}.myhuaweicloud.com/v3/auth/tokens" \
  -H "Content-Type: application/json" \
  -d '{
    "auth": {
      "identity": {
        "methods": ["hw_ak_sk"],
        "hw_ak_sk": {
          "access": {"key": "'${HUAWEI_AK}'"},
          "secret": {"key": "'${HUAWEI_SK}'"}
        }
      },
      "scope": {
        "project": {"id": "'${HUAWEI_PROJECT_ID}'"}
      }
    }
  }' -i | grep "X-Subject-Token" | awk '{print $2}' | tr -d '\r')

# Call openPangu 2.0 Flash
curl -X POST "https://modelarts.${HUAWEI_REGION}.myhuaweicloud.com/v1/infers/openpangu-2-flash/chat/completions" \
  -H "Content-Type: application/json" \
  -H "X-Auth-Token: ${TOKEN}" \
  -d '{
    "model": "openpangu-2.0-flash",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a Python function to calculate the nth Fibonacci number using memoization."}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

Using Python:

import requests
import json

class OpenPanguClient:
    def __init__(self, ak, sk, project_id, region="cn-north-4"):
        self.ak = ak
        self.sk = sk
        self.project_id = project_id
        self.region = region
        self.base_url = f"https://modelarts.{region}.myhuaweicloud.com/v1/infers"
        self.token = self._get_token()
    
    def _get_token(self):
        """Get IAM authentication token."""
        url = f"https://iam.{self.region}.myhuaweicloud.com/v3/auth/tokens"
        payload = {
            "auth": {
                "identity": {
                    "methods": ["hw_ak_sk"],
                    "hw_ak_sk": {
                        "access": {"key": self.ak},
                        "secret": {"key": self.sk}
                    }
                },
                "scope": {
                    "project": {"id": self.project_id}
                }
            }
        }
        response = requests.post(url, json=payload)
        return response.headers.get("X-Subject-Token")
    
    def chat(self, messages, model="openpangu-2.0-flash", max_tokens=1024, 
             temperature=0.7, stream=False):
        """Send a chat completion request."""
        url = f"{self.base_url}/openpangu-2-{'flash' if 'flash' in model else 'pro'}/chat/completions"
        headers = {
            "Content-Type": "application/json",
            "X-Auth-Token": self.token
        }
        payload = {
            "model": model,
            "messages": messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "stream": stream
        }
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()


# Usage
client = OpenPanguClient(
    ak="your-access-key",
    sk="your-secret-key",
    project_id="your-project-id"
)

response = client.chat([
    {"role": "system", "content": "You are a senior software engineer."},
    {"role": "user", "content": "Explain the MoE architecture in openPangu 2.0 and why it matters for inference cost."}
])

print(response["choices"][0]["message"]["content"])

Step 5: Streaming responses

For real-time applications (chatbots, interactive tools), use streaming:

import requests
import json

def stream_pangu(messages, model="openpangu-2.0-flash"):
    """Stream responses token by token."""
    url = f"https://modelarts.cn-north-4.myhuaweicloud.com/v1/infers/openpangu-2-flash/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "X-Auth-Token": token,  # obtained via IAM
        "Accept": "text/event-stream"
    }
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": 2048,
        "temperature": 0.7,
        "stream": True
    }
    
    with requests.post(url, headers=headers, json=payload, stream=True) as response:
        for line in response.iter_lines():
            if line:
                line = line.decode("utf-8")
                if line.startswith("data: "):
                    data = line[6:]
                    if data == "[DONE]":
                        break
                    chunk = json.loads(data)
                    content = chunk["choices"][0]["delta"].get("content", "")
                    if content:
                        print(content, end="", flush=True)


# Usage
stream_pangu([
    {"role": "user", "content": "Write a comprehensive README for a FastAPI microservice."}
])

API parameters reference

ParameterTypeDefaultDescription
modelstringrequiredopenpangu-2.0-pro or openpangu-2.0-flash
messagesarrayrequiredConversation history (system/user/assistant roles)
max_tokensinteger1024Maximum output tokens (up to model limit)
temperaturefloat0.7Randomness (0.0 = deterministic, 1.0 = creative)
top_pfloat0.9Nucleus sampling threshold
streambooleanfalseEnable server-sent events streaming
stoparraynullStop sequences
frequency_penaltyfloat0.0Reduce repetition (0.0-2.0)
presence_penaltyfloat0.0Encourage topic diversity (0.0-2.0)

Long-context usage (512K tokens)

One of openPangu 2.0’s standout features is the 512K token context window. Here is how to use it effectively via API:

def process_long_document(document_text, question):
    """Process a very long document using openPangu's 512K context."""
    messages = [
        {
            "role": "system", 
            "content": "You are a document analysis expert. Answer questions based on the provided document."
        },
        {
            "role": "user",
            "content": f"Document:\n\n{document_text}\n\n---\n\nQuestion: {question}"
        }
    ]
    
    response = client.chat(
        messages=messages,
        model="openpangu-2.0-pro",  # Use Pro for complex analysis
        max_tokens=4096,
        temperature=0.3  # Lower temperature for factual extraction
    )
    return response["choices"][0]["message"]["content"]


# Read a long document (e.g., a full codebase or legal contract)
with open("large_document.txt", "r") as f:
    document = f.read()  # Can be hundreds of thousands of words

answer = process_long_document(document, "Summarize the key obligations in sections 4 through 12.")

Note that very long contexts increase latency and cost proportionally. Flash is recommended for long-context workloads where the task complexity does not require Pro’s 18B active parameters.

Integration with existing frameworks

LangChain integration:

from langchain_community.chat_models import ChatHuaweiPangu

# If community integration is available
llm = ChatHuaweiPangu(
    model="openpangu-2.0-flash",
    huawei_ak="your-ak",
    huawei_sk="your-sk",
    project_id="your-project-id",
    region="cn-north-4",
    temperature=0.7
)

response = llm.invoke("Explain microservice patterns for event sourcing.")

OpenAI-compatible wrapper:

If Huawei provides an OpenAI-compatible endpoint (increasingly standard), you can use the OpenAI Python SDK directly:

from openai import OpenAI

client = OpenAI(
    api_key="your-huawei-token",
    base_url="https://modelarts.cn-north-4.myhuaweicloud.com/v1/infers/openpangu-2-flash"
)

response = client.chat.completions.create(
    model="openpangu-2.0-flash",
    messages=[
        {"role": "user", "content": "Generate a Docker Compose file for a Python API with Redis and PostgreSQL."}
    ]
)

This pattern lets you switch between openPangu, DeepSeek V4 Pro, Qwen 3.7, or any other OpenAI-compatible API by changing the base URL and model name.

Rate limits and quotas

Expected rate limits (subject to your subscription tier):

TierRequests/minuteTokens/minuteConcurrent
Free/trial1050K2
Standard60500K10
Enterprise300+5M+50+

If you hit rate limits, implement exponential backoff:

import time
import random

def call_with_retry(func, max_retries=5):
    """Retry API calls with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return func()
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:  # Rate limited
                wait = (2 ** attempt) + random.random()
                time.sleep(wait)
            else:
                raise
    raise Exception("Max retries exceeded")

Pricing expectations

Huawei has not fully disclosed per-token pricing at launch. Based on the market positioning:

  • Flash is likely priced competitively with DeepSeek V4 Pro ($0.44/$0.87 per M tokens) or lower
  • Pro will be priced higher, reflecting the larger model
  • Free tier likely available for evaluation
  • Enterprise volume discounts expected

For self-hosting as an alternative to API costs, see our how to run openPangu 2.0 locally guide. For cost comparison context, check best cloud GPU providers 2026.

Error handling best practices

import requests
from enum import Enum

class PanguError(Exception):
    pass

class ErrorType(Enum):
    AUTH_FAILED = 401
    RATE_LIMITED = 429
    CONTEXT_TOO_LONG = 400
    SERVER_ERROR = 500

def handle_pangu_response(response):
    """Handle API response with proper error reporting."""
    if response.status_code == 200:
        return response.json()
    elif response.status_code == 401:
        raise PanguError("Authentication failed. Refresh your IAM token.")
    elif response.status_code == 429:
        retry_after = response.headers.get("Retry-After", "60")
        raise PanguError(f"Rate limited. Retry after {retry_after} seconds.")
    elif response.status_code == 400:
        error_detail = response.json().get("error", {}).get("message", "Unknown")
        raise PanguError(f"Bad request: {error_detail}")
    else:
        raise PanguError(f"Server error {response.status_code}: {response.text}")

Production deployment tips

Token management: IAM tokens expire after 24 hours. Implement automatic token refresh in production:

import threading
import time

class TokenManager:
    def __init__(self, ak, sk, project_id, region):
        self.ak = ak
        self.sk = sk
        self.project_id = project_id
        self.region = region
        self.token = None
        self._refresh_token()
        self._start_refresh_loop()
    
    def _refresh_token(self):
        # IAM token request (same as earlier example)
        self.token = self._get_iam_token()
    
    def _start_refresh_loop(self):
        def refresh():
            while True:
                time.sleep(23 * 3600)  # Refresh every 23 hours
                self._refresh_token()
        thread = threading.Thread(target=refresh, daemon=True)
        thread.start()
    
    def get_token(self):
        return self.token

Request logging: Log all API calls for cost tracking and debugging. Include request/response token counts, latency, and model version.

Fallback routing: If ModelArts is unavailable, route to self-hosted openPangu or an alternative model. Never let a single API dependency take down your application.

FAQ

Is the openPangu 2.0 API compatible with OpenAI’s format?

The API follows a similar chat completions structure (messages array with roles, streaming via SSE). Exact compatibility depends on whether Huawei implements the full OpenAI-compatible schema. Community wrappers and LiteLLM integration will likely provide drop-in compatibility regardless.

Can I use the API from outside China?

Yes. Huawei Cloud has international regions (Singapore, Hong Kong, Bangkok, Frankfurt, etc.). Model availability in each region may vary at launch. Start with the region closest to you and check ModelArts AI Gallery availability.

What is the maximum output length?

This depends on the model configuration on ModelArts. Expect maximum output of 8K-16K tokens per request, with the remaining context budget available for input. The 512K window is primarily for input context. Exact limits will be documented in the ModelArts service terms.

How does pricing compare to DeepSeek V4 Pro API?

DeepSeek V4 Pro costs $0.44/$0.87 per million tokens. Huawei has not disclosed openPangu pricing at launch, but given the competitive landscape and Huawei’s strategy of driving cloud platform adoption, expect pricing in a similar range for Flash. Pro may be priced higher reflecting its larger compute footprint.

Can I fine-tune openPangu 2.0 through the API?

ModelArts supports model fine-tuning workflows. Whether openPangu 2.0 fine-tuning is available at launch through the platform is not confirmed. Self-hosted fine-tuning (downloading weights and running your own training) is explicitly supported by the open-source license.

Is there an SLA for the ModelArts API?

Huawei Cloud provides enterprise SLAs for their managed services. Standard Huawei Cloud SLA terms (typically 99.9% availability for managed AI services) should apply. Check your specific service agreement for guarantees.