May 2, 2026 · 9 min read

InclusionAI Ling API Guide — Endpoints, Setup, and Code Examples (2026)

InclusionAI Ling models are open-source, which means you have multiple ways to access them via API. You can use InclusionAI’s own Ling Chat interface, route through aggregators like ZenMux, deploy through HuggingFace Inference Endpoints, or self-host with vLLM and expose your own OpenAI-compatible API. Each approach has different tradeoffs in terms of cost, latency, privacy, and control.

This guide covers every API access method for Ling models: the available endpoints, authentication setup, code examples in Python and JavaScript, integration with coding tools, and cost optimization strategies.

API access options overview

Method	Models available	Cost	Privacy	Setup effort
Ling Chat	Ling 2.6, Flash, Ring 1T	Free / usage-based	Data goes to InclusionAI	None
ZenMux	Ling family	Pay-per-token	Data goes to provider	Low
HuggingFace Inference	All Ling variants	Pay-per-second	HuggingFace infrastructure	Medium
Self-hosted (vLLM)	All Ling variants	Your compute cost	Full privacy	High
OpenRouter	Varies by availability	Pay-per-token	Data goes to provider	Low

Ling Chat — InclusionAI’s native interface

InclusionAI provides Ling Chat as a direct interface to their models. This is the simplest way to try Ling models without any setup.

Getting started

Visit the Ling Chat platform
Create an account or sign in
Select your model (Ling 2.6, Flash, or Ring 1T)
Start chatting

Ling Chat provides a web interface similar to ChatGPT or Claude, but powered by InclusionAI’s models. For API access through Ling Chat, you will need to generate an API key from your account settings.

API usage with Ling Chat

from openai import OpenAI

client = OpenAI(
    base_url="https://api.ling.inclusionai.com/v1",  # Check InclusionAI docs for current URL
    api_key="your-ling-api-key"
)

response = client.chat.completions.create(
    model="ling-2.6-flash",
    messages=[
        {"role": "system", "content": "You are a senior software engineer. Write clean, production-ready code."},
        {"role": "user", "content": "Write a Python decorator that retries failed async functions with exponential backoff."}
    ],
    temperature=0.1,
    max_tokens=2048
)

print(response.choices[0].message.content)

The API follows the OpenAI-compatible format, so any tool or library that works with OpenAI’s API works with Ling Chat’s API endpoint.

ZenMux — API aggregator access

ZenMux is an API aggregator that provides access to multiple AI models through a single endpoint, including InclusionAI Ling models. It handles load balancing, failover, and unified billing.

Setup

from openai import OpenAI

client = OpenAI(
    base_url="https://api.zenmux.ai/v1",  # Check ZenMux docs for current URL
    api_key="your-zenmux-api-key"
)

response = client.chat.completions.create(
    model="inclusionai/ling-2.6-flash",
    messages=[
        {"role": "user", "content": "Refactor this Express.js route to use proper error handling and input validation."}
    ],
    temperature=0.1
)

print(response.choices[0].message.content)

ZenMux is useful when you want to switch between models easily or need fallback routing — if Ling is unavailable, ZenMux can automatically route to an alternative model.

HuggingFace Inference Endpoints

HuggingFace lets you deploy Ling models as dedicated inference endpoints. You get your own instance running on HuggingFace’s infrastructure, with pay-per-second billing.

Deploying Ling Flash on HuggingFace

Go to huggingface.co/inclusionai/Ling-2.6-Flash
Click “Deploy” → “Inference Endpoints”
Select your GPU (A100 recommended for Flash)
Configure scaling (min/max replicas)
Deploy

Using the endpoint

import requests

API_URL = "https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud"
headers = {"Authorization": "Bearer your-hf-token"}

def query(payload):
    response = requests.post(
        f"{API_URL}/v1/chat/completions",
        headers=headers,
        json=payload
    )
    return response.json()

result = query({
    "model": "inclusionai/Ling-2.6-Flash",
    "messages": [
        {"role": "user", "content": "Write a TypeScript generic function that deep-merges two objects."}
    ],
    "max_tokens": 1024,
    "temperature": 0.1
})

print(result["choices"][0]["message"]["content"])

HuggingFace Inference Endpoints are a good middle ground: you get dedicated compute without managing infrastructure, and the model runs on your own endpoint rather than a shared API.

Self-hosted API with vLLM

For maximum control and privacy, self-host Ling models using vLLM. This gives you an OpenAI-compatible API running on your own infrastructure.

Setup

# Install vLLM
pip install vllm

# Start the API server
python -m vllm.entrypoints.openai.api_server \
  --model inclusionai/Ling-2.6-Flash \
  --max-model-len 16384 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

Using the self-hosted API

from openai import OpenAI

# Point to your self-hosted server
client = OpenAI(
    base_url="http://your-server:8000/v1",
    api_key="not-needed"  # No auth needed for local server
)

# Chat completion
response = client.chat.completions.create(
    model="inclusionai/Ling-2.6-Flash",
    messages=[
        {"role": "system", "content": "You are a coding assistant specializing in Python and TypeScript."},
        {"role": "user", "content": "Write a FastAPI middleware that logs request/response times and adds correlation IDs."}
    ],
    temperature=0.1,
    max_tokens=2048
)

print(response.choices[0].message.content)

JavaScript/TypeScript client

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://your-server:8000/v1',
  apiKey: 'not-needed',
});

async function generateCode(prompt: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: 'inclusionai/Ling-2.6-Flash',
    messages: [
      { role: 'system', content: 'You are a coding assistant. Return only code, no explanations.' },
      { role: 'user', content: prompt },
    ],
    temperature: 0.1,
    max_tokens: 2048,
  });

  return response.choices[0].message.content ?? '';
}

// Usage
const code = await generateCode(
  'Write a Zod schema for a user registration form with email, password, and optional phone number.'
);
console.log(code);

Streaming responses

For interactive coding tools, streaming provides a better user experience — you see tokens as they are generated rather than waiting for the full response.

Python streaming

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

stream = client.chat.completions.create(
    model="inclusionai/Ling-2.6-Flash",
    messages=[
        {"role": "user", "content": "Write a comprehensive test suite for a shopping cart module in Python using pytest."}
    ],
    temperature=0.1,
    max_tokens=4096,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

JavaScript streaming

const stream = await client.chat.completions.create({
  model: 'inclusionai/Ling-2.6-Flash',
  messages: [
    { role: 'user', content: 'Write a React hook for infinite scrolling with intersection observer.' },
  ],
  temperature: 0.1,
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

Function calling and tool use

Ling models support function calling for agentic workflows. This lets the model decide when to call external tools and how to use their results.

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "The file path to read"
                    }
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "The file path to write"
                    },
                    "content": {
                        "type": "string",
                        "description": "The content to write"
                    }
                },
                "required": ["path", "content"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="inclusionai/Ling-2.6-Flash",
    messages=[
        {"role": "user", "content": "Read the file src/utils.ts and add a debounce utility function to it."}
    ],
    tools=tools,
    tool_choice="auto"
)

# The model will respond with a tool call to read_file first
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        print(f"Tool: {tool_call.function.name}")
        print(f"Args: {tool_call.function.arguments}")

Integration with coding tools

Aider

# Using any Ling API endpoint
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed

aider --model openai/inclusionai/Ling-2.6-Flash

Continue (VS Code)

{
  "models": [
    {
      "title": "Ling Flash",
      "provider": "openai",
      "model": "inclusionai/Ling-2.6-Flash",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "not-needed"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ling Flash Autocomplete",
    "provider": "openai",
    "model": "inclusionai/Ling-2.6-Flash",
    "apiBase": "http://localhost:8000/v1",
    "apiKey": "not-needed"
  }
}

OpenCode

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
opencode --model inclusionai/Ling-2.6-Flash

cURL (for testing and scripts)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inclusionai/Ling-2.6-Flash",
    "messages": [
      {"role": "system", "content": "You are a coding assistant."},
      {"role": "user", "content": "Write a SQL query to find duplicate emails in a users table."}
    ],
    "temperature": 0.1,
    "max_tokens": 512
  }'

Cost optimization

Self-hosted cost calculation

For self-hosted deployments, your cost is the GPU compute:

Cloud GPU (e.g., A100 on RunPod): ~$1-3/hour depending on provider
Local GPU (RTX 4090): Electricity cost only (~$0.05-0.10/hour)
Mac (M-series): Electricity cost only (~$0.02-0.05/hour)

Compare this to API pricing from commercial providers, which typically charge $0.50-5.00 per million tokens. If you generate more than a few hundred thousand tokens per day, self-hosting is significantly cheaper.

Reducing token usage

Ling models are already optimized for token efficiency, but you can reduce costs further:

Use system prompts wisely. A concise system prompt like “You are a coding assistant. Return only code.” reduces output tokens significantly compared to verbose instructions.
Set max_tokens appropriately. Do not set max_tokens to 4096 if you expect a 200-token response. Lower limits prevent runaway generation.
Use temperature 0.0 for deterministic tasks. This produces the most concise output and avoids rambling.
Cache common responses. If you frequently ask the same types of questions, implement response caching.

For more strategies on reducing API costs, see our guide on how to reduce LLM API costs.

Error handling

Robust API integration requires proper error handling:

from openai import OpenAI, APIError, APIConnectionError, RateLimitError
import time

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

def generate_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="inclusionai/Ling-2.6-Flash",
                messages=messages,
                temperature=0.1,
                max_tokens=2048,
                timeout=60
            )
            return response.choices[0].message.content
        except APIConnectionError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue
            raise
        except RateLimitError:
            time.sleep(5 * (attempt + 1))
            continue
        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

Choosing the right model for API use

Model	Best for	API latency	Cost
Ling-Lite (2.75B active)	Fast completions, autocomplete	Very low	Very low
Ling Flash (7.4B active)	General coding, balanced speed/quality	Low	Low
Ling-Plus (28.8B active)	Complex tasks, production quality	Medium	Medium
Ling 2.6 (1T total)	Maximum quality, complex reasoning	High	High
Ring 1T	Extended reasoning, debugging	High	High

For most API use cases, Ling Flash provides the best balance of quality, speed, and cost. Use Ling-Lite for autocomplete and fast completions where latency matters more than quality. Use Ling-Plus or Ling 2.6 when you need the highest quality output and can tolerate higher latency.

For the full model specifications, see our Ling 2.6 complete guide. For an overview of the InclusionAI ecosystem, see What is InclusionAI.

FAQ

Is the InclusionAI Ling API free?

The models themselves are free and open-source. If you self-host with vLLM, the only cost is your compute. Third-party API providers (ZenMux, HuggingFace Inference Endpoints) charge for compute and/or per-token usage. InclusionAI’s own Ling Chat may offer free tiers with usage limits — check their current pricing.

Is the Ling API compatible with OpenAI’s API format?

Yes. When served through vLLM, Ling models expose an OpenAI-compatible API. This means any tool, library, or framework that works with OpenAI’s API works with Ling — including the official OpenAI Python and JavaScript SDKs, Aider, Continue, OpenCode, and others.

What is the rate limit for the Ling API?

For self-hosted deployments, there is no rate limit — you are limited only by your hardware’s throughput. For third-party providers, rate limits depend on the provider and your plan. InclusionAI’s own API rate limits are documented in their developer portal.

Can I use Ling models with LangChain or LlamaIndex?

Yes. Both LangChain and LlamaIndex support OpenAI-compatible endpoints. Point them to your vLLM server or any Ling API endpoint, and they work out of the box. No special adapters or plugins needed.

How do I switch between Ling models in the same application?

If you are using a self-hosted vLLM server, you can serve multiple models on different ports or use vLLM’s multi-model serving feature. If you are using an API provider, simply change the model name in your API call. The API format is identical across all Ling variants.

Does the Ling API support batch processing?

vLLM supports continuous batching automatically — multiple concurrent requests are batched together for efficient GPU utilization. For offline batch processing, you can send multiple requests concurrently using async Python or parallel HTTP requests. There is no dedicated batch API endpoint, but the standard chat completions endpoint handles concurrent requests efficiently.

InclusionAI Ling API Guide — Endpoints, Setup, and Code Examples (2026)

API access options overview

Ling Chat — InclusionAI’s native interface

Getting started

API usage with Ling Chat

ZenMux — API aggregator access

Setup

HuggingFace Inference Endpoints

Deploying Ling Flash on HuggingFace

Using the endpoint

Self-hosted API with vLLM

Setup

Using the self-hosted API

JavaScript/TypeScript client

Streaming responses

Python streaming

JavaScript streaming

Function calling and tool use

Integration with coding tools

Aider

Continue (VS Code)

OpenCode

cURL (for testing and scripts)

Cost optimization

Self-hosted cost calculation

Reducing token usage

Error handling

Choosing the right model for API use

FAQ

Is the InclusionAI Ling API free?

Is the Ling API compatible with OpenAI’s API format?

What is the rate limit for the Ling API?

Can I use Ling models with LangChain or LlamaIndex?

How do I switch between Ling models in the same application?

Does the Ling API support batch processing?

📬 AI Dev Weekly

You might also like

How to Run InclusionAI Ling Flash Locally — The 7.4B Active Coding Model (2026)

InclusionAI Ling 2.6 Complete Guide — 1T Coding-Optimized MoE (2026)

InclusionAI Ling Flash Complete Guide — 104B Model with 7.4B Active (2026)

What is InclusionAI? Ling Models and the Trillion-Parameter Coding Series (2026)