InclusionAI Ling models are open-source, which means you have multiple ways to access them via API. You can use InclusionAI’s own Ling Chat interface, route through aggregators like ZenMux, deploy through HuggingFace Inference Endpoints, or self-host with vLLM and expose your own OpenAI-compatible API. Each approach has different tradeoffs in terms of cost, latency, privacy, and control.
This guide covers every API access method for Ling models: the available endpoints, authentication setup, code examples in Python and JavaScript, integration with coding tools, and cost optimization strategies.
API access options overview
| Method | Models available | Cost | Privacy | Setup effort |
|---|---|---|---|---|
| Ling Chat | Ling 2.6, Flash, Ring 1T | Free / usage-based | Data goes to InclusionAI | None |
| ZenMux | Ling family | Pay-per-token | Data goes to provider | Low |
| HuggingFace Inference | All Ling variants | Pay-per-second | HuggingFace infrastructure | Medium |
| Self-hosted (vLLM) | All Ling variants | Your compute cost | Full privacy | High |
| OpenRouter | Varies by availability | Pay-per-token | Data goes to provider | Low |
Ling Chat — InclusionAI’s native interface
InclusionAI provides Ling Chat as a direct interface to their models. This is the simplest way to try Ling models without any setup.
Getting started
- Visit the Ling Chat platform
- Create an account or sign in
- Select your model (Ling 2.6, Flash, or Ring 1T)
- Start chatting
Ling Chat provides a web interface similar to ChatGPT or Claude, but powered by InclusionAI’s models. For API access through Ling Chat, you will need to generate an API key from your account settings.
API usage with Ling Chat
from openai import OpenAI
client = OpenAI(
base_url="https://api.ling.inclusionai.com/v1", # Check InclusionAI docs for current URL
api_key="your-ling-api-key"
)
response = client.chat.completions.create(
model="ling-2.6-flash",
messages=[
{"role": "system", "content": "You are a senior software engineer. Write clean, production-ready code."},
{"role": "user", "content": "Write a Python decorator that retries failed async functions with exponential backoff."}
],
temperature=0.1,
max_tokens=2048
)
print(response.choices[0].message.content)
The API follows the OpenAI-compatible format, so any tool or library that works with OpenAI’s API works with Ling Chat’s API endpoint.
ZenMux — API aggregator access
ZenMux is an API aggregator that provides access to multiple AI models through a single endpoint, including InclusionAI Ling models. It handles load balancing, failover, and unified billing.
Setup
from openai import OpenAI
client = OpenAI(
base_url="https://api.zenmux.ai/v1", # Check ZenMux docs for current URL
api_key="your-zenmux-api-key"
)
response = client.chat.completions.create(
model="inclusionai/ling-2.6-flash",
messages=[
{"role": "user", "content": "Refactor this Express.js route to use proper error handling and input validation."}
],
temperature=0.1
)
print(response.choices[0].message.content)
ZenMux is useful when you want to switch between models easily or need fallback routing — if Ling is unavailable, ZenMux can automatically route to an alternative model.
HuggingFace Inference Endpoints
HuggingFace lets you deploy Ling models as dedicated inference endpoints. You get your own instance running on HuggingFace’s infrastructure, with pay-per-second billing.
Deploying Ling Flash on HuggingFace
- Go to huggingface.co/inclusionai/Ling-2.6-Flash
- Click “Deploy” → “Inference Endpoints”
- Select your GPU (A100 recommended for Flash)
- Configure scaling (min/max replicas)
- Deploy
Using the endpoint
import requests
API_URL = "https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud"
headers = {"Authorization": "Bearer your-hf-token"}
def query(payload):
response = requests.post(
f"{API_URL}/v1/chat/completions",
headers=headers,
json=payload
)
return response.json()
result = query({
"model": "inclusionai/Ling-2.6-Flash",
"messages": [
{"role": "user", "content": "Write a TypeScript generic function that deep-merges two objects."}
],
"max_tokens": 1024,
"temperature": 0.1
})
print(result["choices"][0]["message"]["content"])
HuggingFace Inference Endpoints are a good middle ground: you get dedicated compute without managing infrastructure, and the model runs on your own endpoint rather than a shared API.
Self-hosted API with vLLM
For maximum control and privacy, self-host Ling models using vLLM. This gives you an OpenAI-compatible API running on your own infrastructure.
Setup
# Install vLLM
pip install vllm
# Start the API server
python -m vllm.entrypoints.openai.api_server \
--model inclusionai/Ling-2.6-Flash \
--max-model-len 16384 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Using the self-hosted API
from openai import OpenAI
# Point to your self-hosted server
client = OpenAI(
base_url="http://your-server:8000/v1",
api_key="not-needed" # No auth needed for local server
)
# Chat completion
response = client.chat.completions.create(
model="inclusionai/Ling-2.6-Flash",
messages=[
{"role": "system", "content": "You are a coding assistant specializing in Python and TypeScript."},
{"role": "user", "content": "Write a FastAPI middleware that logs request/response times and adds correlation IDs."}
],
temperature=0.1,
max_tokens=2048
)
print(response.choices[0].message.content)
JavaScript/TypeScript client
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://your-server:8000/v1',
apiKey: 'not-needed',
});
async function generateCode(prompt: string): Promise<string> {
const response = await client.chat.completions.create({
model: 'inclusionai/Ling-2.6-Flash',
messages: [
{ role: 'system', content: 'You are a coding assistant. Return only code, no explanations.' },
{ role: 'user', content: prompt },
],
temperature: 0.1,
max_tokens: 2048,
});
return response.choices[0].message.content ?? '';
}
// Usage
const code = await generateCode(
'Write a Zod schema for a user registration form with email, password, and optional phone number.'
);
console.log(code);
Streaming responses
For interactive coding tools, streaming provides a better user experience — you see tokens as they are generated rather than waiting for the full response.
Python streaming
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
stream = client.chat.completions.create(
model="inclusionai/Ling-2.6-Flash",
messages=[
{"role": "user", "content": "Write a comprehensive test suite for a shopping cart module in Python using pytest."}
],
temperature=0.1,
max_tokens=4096,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
JavaScript streaming
const stream = await client.chat.completions.create({
model: 'inclusionai/Ling-2.6-Flash',
messages: [
{ role: 'user', content: 'Write a React hook for infinite scrolling with intersection observer.' },
],
temperature: 0.1,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
Function calling and tool use
Ling models support function calling for agentic workflows. This lets the model decide when to call external tools and how to use their results.
from openai import OpenAI
import json
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
tools = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "The file path to read"
}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "The file path to write"
},
"content": {
"type": "string",
"description": "The content to write"
}
},
"required": ["path", "content"]
}
}
}
]
response = client.chat.completions.create(
model="inclusionai/Ling-2.6-Flash",
messages=[
{"role": "user", "content": "Read the file src/utils.ts and add a debounce utility function to it."}
],
tools=tools,
tool_choice="auto"
)
# The model will respond with a tool call to read_file first
if response.choices[0].message.tool_calls:
for tool_call in response.choices[0].message.tool_calls:
print(f"Tool: {tool_call.function.name}")
print(f"Args: {tool_call.function.arguments}")
Integration with coding tools
Aider
# Using any Ling API endpoint
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
aider --model openai/inclusionai/Ling-2.6-Flash
Continue (VS Code)
{
"models": [
{
"title": "Ling Flash",
"provider": "openai",
"model": "inclusionai/Ling-2.6-Flash",
"apiBase": "http://localhost:8000/v1",
"apiKey": "not-needed"
}
],
"tabAutocompleteModel": {
"title": "Ling Flash Autocomplete",
"provider": "openai",
"model": "inclusionai/Ling-2.6-Flash",
"apiBase": "http://localhost:8000/v1",
"apiKey": "not-needed"
}
}
OpenCode
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
opencode --model inclusionai/Ling-2.6-Flash
cURL (for testing and scripts)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionai/Ling-2.6-Flash",
"messages": [
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Write a SQL query to find duplicate emails in a users table."}
],
"temperature": 0.1,
"max_tokens": 512
}'
Cost optimization
Self-hosted cost calculation
For self-hosted deployments, your cost is the GPU compute:
- Cloud GPU (e.g., A100 on RunPod): ~$1-3/hour depending on provider
- Local GPU (RTX 4090): Electricity cost only (~$0.05-0.10/hour)
- Mac (M-series): Electricity cost only (~$0.02-0.05/hour)
Compare this to API pricing from commercial providers, which typically charge $0.50-5.00 per million tokens. If you generate more than a few hundred thousand tokens per day, self-hosting is significantly cheaper.
Reducing token usage
Ling models are already optimized for token efficiency, but you can reduce costs further:
- Use system prompts wisely. A concise system prompt like “You are a coding assistant. Return only code.” reduces output tokens significantly compared to verbose instructions.
- Set max_tokens appropriately. Do not set max_tokens to 4096 if you expect a 200-token response. Lower limits prevent runaway generation.
- Use temperature 0.0 for deterministic tasks. This produces the most concise output and avoids rambling.
- Cache common responses. If you frequently ask the same types of questions, implement response caching.
For more strategies on reducing API costs, see our guide on how to reduce LLM API costs.
Error handling
Robust API integration requires proper error handling:
from openai import OpenAI, APIError, APIConnectionError, RateLimitError
import time
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
def generate_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="inclusionai/Ling-2.6-Flash",
messages=messages,
temperature=0.1,
max_tokens=2048,
timeout=60
)
return response.choices[0].message.content
except APIConnectionError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
raise
except RateLimitError:
time.sleep(5 * (attempt + 1))
continue
except APIError as e:
print(f"API error: {e}")
raise
return None
Choosing the right model for API use
| Model | Best for | API latency | Cost |
|---|---|---|---|
| Ling-Lite (2.75B active) | Fast completions, autocomplete | Very low | Very low |
| Ling Flash (7.4B active) | General coding, balanced speed/quality | Low | Low |
| Ling-Plus (28.8B active) | Complex tasks, production quality | Medium | Medium |
| Ling 2.6 (1T total) | Maximum quality, complex reasoning | High | High |
| Ring 1T | Extended reasoning, debugging | High | High |
For most API use cases, Ling Flash provides the best balance of quality, speed, and cost. Use Ling-Lite for autocomplete and fast completions where latency matters more than quality. Use Ling-Plus or Ling 2.6 when you need the highest quality output and can tolerate higher latency.
For the full model specifications, see our Ling 2.6 complete guide. For an overview of the InclusionAI ecosystem, see What is InclusionAI.
FAQ
Is the InclusionAI Ling API free?
The models themselves are free and open-source. If you self-host with vLLM, the only cost is your compute. Third-party API providers (ZenMux, HuggingFace Inference Endpoints) charge for compute and/or per-token usage. InclusionAI’s own Ling Chat may offer free tiers with usage limits — check their current pricing.
Is the Ling API compatible with OpenAI’s API format?
Yes. When served through vLLM, Ling models expose an OpenAI-compatible API. This means any tool, library, or framework that works with OpenAI’s API works with Ling — including the official OpenAI Python and JavaScript SDKs, Aider, Continue, OpenCode, and others.
What is the rate limit for the Ling API?
For self-hosted deployments, there is no rate limit — you are limited only by your hardware’s throughput. For third-party providers, rate limits depend on the provider and your plan. InclusionAI’s own API rate limits are documented in their developer portal.
Can I use Ling models with LangChain or LlamaIndex?
Yes. Both LangChain and LlamaIndex support OpenAI-compatible endpoints. Point them to your vLLM server or any Ling API endpoint, and they work out of the box. No special adapters or plugins needed.
How do I switch between Ling models in the same application?
If you are using a self-hosted vLLM server, you can serve multiple models on different ports or use vLLM’s multi-model serving feature. If you are using an API provider, simply change the model name in your API call. The API format is identical across all Ling variants.
Does the Ling API support batch processing?
vLLM supports continuous batching automatically — multiple concurrent requests are batched together for efficient GPU utilization. For offline batch processing, you can send multiple requests concurrently using async Python or parallel HTTP requests. There is no dedicated batch API endpoint, but the standard chat completions endpoint handles concurrent requests efficiently.