📝 Tutorials
· 7 min read

Kimi K2.7 Code API Guide: Setup, Pricing, and First Request


Moonshot’s Kimi K2.7 Code is one of the most capable coding models available today — 1 trillion parameters with 32 billion active via MoE, 256K context, and interleaved thinking that actually works. The best part? You can access it through a standard OpenAI-compatible API, which means you can drop it into almost any tool or workflow you’re already using.

In this guide, I’ll walk you through setting up API access, making your first request, enabling thinking mode, using tool calling, and streaming responses. If you’re wondering whether to use the API or self-host locally, this article will help you decide.

Setting Up Your Moonshot Account

First, you need access to the Moonshot platform:

  1. Go to platform.moonshot.ai
  2. Create an account (email or GitHub login)
  3. Navigate to API Keys in your dashboard
  4. Generate a new API key
  5. Save it somewhere secure — you won’t see it again

The platform offers both free tier credits for testing and pay-as-you-go pricing for production use.

API Basics

Moonshot provides two compatible endpoint formats:

  • OpenAI-compatible: Drop-in replacement for OpenAI’s API format
  • Anthropic-compatible: For tools that expect Claude-style message formatting

The base URL for both:

https://api.moonshot.ai/v1

The model ID for K2.7 Code:

kimi-k2.7-code

Your First API Request

Let’s start with a simple curl request to verify everything works:

curl https://api.moonshot.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MOONSHOT_API_KEY" \
  -d '{
    "model": "kimi-k2.7-code",
    "messages": [
      {"role": "system", "content": "You are a senior software engineer. Write clean, well-documented code."},
      {"role": "user", "content": "Write a TypeScript function that debounces an async function and cancels pending calls"}
    ],
    "max_tokens": 2048,
    "temperature": 0.7
  }'

If you get a valid response with generated code, you’re good to go.

Python SDK Setup

For most developers, using Python with the OpenAI SDK is the most ergonomic approach:

pip install openai
from openai import OpenAI

client = OpenAI(
    api_key="your-moonshot-api-key",
    base_url="https://api.moonshot.ai/v1"
)

response = client.chat.completions.create(
    model="kimi-k2.7-code",
    messages=[
        {"role": "system", "content": "You are an expert Python developer."},
        {"role": "user", "content": "Implement a thread-safe LRU cache with TTL expiry"}
    ],
    max_tokens=4096,
    temperature=0.7
)

print(response.choices[0].message.content)

That’s it. Because the API is OpenAI-compatible, you can use the official OpenAI Python SDK with just a different base_url and API key.

Enabling Thinking Mode

Kimi K2.7 Code supports “preserve thinking” — the model’s reasoning is retained across turns, giving it better coherence in multi-turn conversations. This is especially powerful for complex coding tasks where the model needs to track state across multiple interactions.

response = client.chat.completions.create(
    model="kimi-k2.7-code",
    messages=[
        {"role": "user", "content": "Design a pub/sub system in Go that supports wildcard topic matching"}
    ],
    max_tokens=8192,
    temperature=0.7,
    extra_body={
        "thinking": {
            "enabled": True,
            "preserve": True
        }
    }
)

if hasattr(response.choices[0].message, 'thinking'):
    print("Reasoning:", response.choices[0].message.thinking)
print("Response:", response.choices[0].message.content)

With preserve thinking enabled, follow-up messages in the same conversation will benefit from the model’s previous reasoning — it doesn’t “forget” its analysis between turns.

Streaming Responses

For interactive applications or coding tools, streaming gives you token-by-token output:

stream = client.chat.completions.create(
    model="kimi-k2.7-code",
    messages=[
        {"role": "user", "content": "Write a Rust HTTP server with graceful shutdown handling"}
    ],
    max_tokens=4096,
    temperature=0.7,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming works exactly like OpenAI’s streaming API — same SSE format, same delta structure.

Tool Calling

K2.7 Code supports interleaved thinking and multi-step tool calling, which makes it excellent for agentic workflows. Here’s how to define and use tools:

tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "The file path to read"
                    }
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "The file path"},
                    "content": {"type": "string", "description": "Content to write"}
                },
                "required": ["path", "content"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="kimi-k2.7-code",
    messages=[
        {"role": "user", "content": "Read the config.yaml file and add a new database connection pool setting"}
    ],
    tools=tools,
    tool_choice="auto",
    max_tokens=4096
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Tool: {tool_call.function.name}")
        print(f"Args: {tool_call.function.arguments}")

The multi-step capability means K2.7 can chain multiple tool calls in a single turn — read a file, analyze it, then write a modified version — all without needing separate user messages in between. For a deeper understanding of how tool calling works, see our tool calling guide.

Multi-Turn Conversation Example

Here’s a more realistic example showing a multi-turn coding session:

from openai import OpenAI

client = OpenAI(
    api_key="your-moonshot-api-key",
    base_url="https://api.moonshot.ai/v1"
)

messages = [
    {"role": "system", "content": "You are a senior full-stack developer helping with a FastAPI project."}
]

def chat(user_message):
    messages.append({"role": "user", "content": user_message})
    
    response = client.chat.completions.create(
        model="kimi-k2.7-code",
        messages=messages,
        max_tokens=4096,
        temperature=0.7,
        extra_body={"thinking": {"enabled": True, "preserve": True}}
    )
    
    assistant_message = response.choices[0].message.content
    messages.append({"role": "assistant", "content": assistant_message})
    return assistant_message

print(chat("I need to add WebSocket support to my FastAPI app for real-time notifications"))
print(chat("Now add authentication to the WebSocket connections using JWT"))
print(chat("Add a connection manager that handles disconnections gracefully"))

With preserve thinking, each response builds on the model’s accumulated understanding of your project structure.

Error Handling and Retries

Production code needs proper error handling:

from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time

client = OpenAI(
    api_key="your-moonshot-api-key",
    base_url="https://api.moonshot.ai/v1",
    timeout=120.0,
    max_retries=3
)

def robust_completion(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="kimi-k2.7-code",
                messages=messages,
                max_tokens=4096,
                temperature=0.7
            )
            return response.choices[0].message.content
        except RateLimitError:
            wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
        except APITimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            continue
        except APIError as e:
            print(f"API error: {e}")
            raise
    
    raise Exception("Max retries exceeded")

API vs Self-Hosting: When to Choose What

FactorAPISelf-Hosted
Setup time5 minutes2-4 hours
MaintenanceZeroOngoing
PrivacyData sent to MoonshotFully private
Cost (low volume)Pay per tokenHigh fixed cost
Cost (high volume)ExpensivePredictable
Rate limitsPlatform-imposedNone
LatencyNetwork + inferenceInference only

For most individual developers and small teams, the API is the right choice. You get started instantly and only pay for what you use. If you’re processing sensitive code at scale, self-hosting makes more sense.

Using K2.7 with Coding Tools

The OpenAI-compatible API means K2.7 works with the entire ecosystem of coding tools:

  • Aider: Set K2.7 as your model with a custom API base
  • OpenCode: Configure as an OpenAI-compatible provider
  • Kimi Code CLI: Purpose-built for K2.7, best experience
  • Continue, Cursor, etc.: Any tool supporting custom OpenAI endpoints

For detailed setup instructions with these tools, check our K2.7 integration guide.

Comparison with K2.6 API

If you’ve been using the K2.6 API, here’s what’s different in K2.7 Code:

  • Specialized for code: K2.7 Code is fine-tuned specifically for programming tasks
  • Better tool calling: Multi-step tool calls in a single turn
  • Preserve thinking: Reasoning state maintained across conversation turns
  • MoonViT vision: Can process screenshots, diagrams, and images of code
  • Same API format: Drop-in replacement — just change the model ID

FAQ

What’s the model ID for Kimi K2.7 Code in API calls?

The model ID is kimi-k2.7-code. Use this in the model field of your API requests. The base URL is https://api.moonshot.ai/v1 and the API format is OpenAI-compatible.

Does Kimi K2.7 Code API support function/tool calling?

Yes. K2.7 Code supports full OpenAI-style tool calling with multi-step execution. The model can chain multiple tool calls in a single response turn, making it excellent for agentic coding workflows. It also interleaves thinking with tool calls for better reasoning.

How does “preserve thinking” work in the API?

Preserve thinking forces the model to retain its reasoning chain across conversation turns. Enable it with extra_body={"thinking": {"enabled": True, "preserve": True}}. This means the model’s analysis from turn 1 informs its responses in turn 2, 3, and beyond — without you needing to repeat context.

Can I use the OpenAI Python SDK with Moonshot’s API?

Absolutely. Just set base_url="https://api.moonshot.ai/v1" and use your Moonshot API key. The API is fully OpenAI-compatible — chat completions, streaming, tool calling, all work identically. There’s also an Anthropic-compatible endpoint if your tools expect that format.

What’s the context window limit for K2.7 Code via API?

256K tokens. This is the full context window available through the API. You can send long codebases, full file contents, and lengthy conversation histories. Token counting follows the same conventions as other large models — roughly 4 characters per token for code.

Next Steps

Now that you have API access working, consider: