Kimi K2.7 Code Complete Guide: 1T Coding Agent That Beats Opus on Tool Use (2026)
Moonshot AI just dropped Kimi K2.7 Code today, and it’s a big deal. This is a 1 trillion parameter open-source coding model that actually beats Claude Opus 4.8 on tool use benchmarks. Let that sink in for a second — an open-source model outperforming one of the most expensive closed models on agentic coding tasks.
I’ve been tracking the Kimi line since K2.5 and through the impressive K2.6 release, and K2.7 Code represents a focused evolution: less about being everything to everyone, more about being the best open-source coding agent you can run.
Let me break down everything you need to know.
What Is Kimi K2.7 Code?
Kimi K2.7 Code is Moonshot AI’s latest large language model, released June 12, 2026 under a Modified MIT license. It’s built specifically for coding and agentic tasks — think code generation, debugging, tool use, and multi-step programming workflows.
The key stats:
- 1 trillion total parameters (Mixture of Experts)
- 32 billion activated per token (efficient inference)
- 256K token context window
- 384 experts, 8 selected + 1 shared per forward pass
- MoonViT vision encoder (400M params) for multimodal input
- 61 layers with MLA attention and SwiGLU activation
It’s available on HuggingFace, the Moonshot API, and ModelScope. You can run it locally with vLLM, SGLang, or Docker Model Runner.
Architecture Deep Dive
K2.7 Code uses a Mixture of Experts (MoE) architecture, which is why it can have 1T total parameters while only activating 32B per token. This gives you frontier-level intelligence at a fraction of the compute cost of a dense model with equivalent capabilities.
Here’s how the expert routing works:
- 384 total experts across the model
- 8 experts selected per token based on learned routing
- 1 shared expert always active (handles common patterns)
- MLA (Multi-Latent Attention) for efficient KV-cache compression
- SwiGLU activation function (smoother gradients than ReLU)
- 61 transformer layers deep
The MLA attention mechanism is particularly clever — it compresses the key-value cache using learned latent projections, which means you can actually fit that 256K context window in memory without needing an absurd amount of VRAM.
For those wanting to run it locally, there’s a native INT4 quantization available that significantly reduces memory requirements while maintaining most of the model’s capability.
What Changed from K2.6
If you’ve been using K2.6, here’s what’s different:
30% Fewer Thinking Tokens
K2.7 Code uses 30% fewer thinking tokens than K2.6 to reach the same conclusions. That’s not a marginal improvement — it means faster responses, lower costs, and less wasted compute on reasoning overhead.
Coding-Focused Agentic Fine-Tuning
While K2.6 was a generalist with agent swarm capabilities, K2.7 Code is laser-focused on coding tasks. The fine-tuning pipeline specifically targeted:
- Code generation and completion
- Multi-file editing workflows
- Tool calling and MCP integration
- Debugging and refactoring patterns
Preserve Thinking Mode
This is a genuinely novel feature. In “Preserve Thinking” mode, K2.7 Code maintains its reasoning chain across multiple conversation turns. Most models reset their internal reasoning with each new message — K2.7 keeps the thread, which means it doesn’t lose context about why it made certain decisions in complex multi-step coding tasks.
+21.8% on Kimi Code Bench
The improvement over K2.6 on Moonshot’s own coding benchmark is dramatic: from 50.9 to 62.0, a 21.8% jump. That’s not incremental — that’s a generational leap in coding capability.
Benchmark Performance
Here’s how K2.7 Code stacks up against the competition:
| Benchmark | K2.6 | K2.7 Code | GPT-5.5 | Opus 4.8 |
|---|---|---|---|---|
| Kimi Code Bench v2 | 50.9 | 62.0 | 69.0 | 67.4 |
| Program Bench | 48.3 | 53.6 | 69.1 | 63.8 |
| MLS Bench Lite | 26.7 | 35.1 | 35.4 | 81.3 |
| MCP Mark Verified | 72.8 | 81.1 | 92.9 | 76.4 |
The headline number: K2.7 Code scores 81.1% on MCPMark Verified, beating Claude Opus 4.8’s 76.4%. That means it’s better at using tools via MCP than a model that costs $5/$25 per million tokens.
On Kimi Code Bench v2, the gap to GPT-5.5 narrowed from 18 points (K2.6 era) to just 7 points. Open-source is catching up fast.
Where it still lags: MLS Bench Lite (inventing novel ML methods) shows Opus 4.8 at 81.3% vs K2.7’s 35.1%. For pure research creativity, the closed models still dominate.
How to Access Kimi K2.7 Code
Moonshot API
The easiest path. Sign up at kimi.com, grab an API key, and you’re running. Pricing is similar to K2.6 — the Moderato plan runs about $19/month, or you can pay per token.
HuggingFace
The full model weights are at moonshotai/Kimi-K2.7-Code. Download and run locally with your preferred framework.
Kimi Code CLI
K2.7 Code works best with the Kimi Code CLI available at kimi.com/code. This gives you the full agentic coding experience with native MCP tool integration.
Self-Hosting Options
- vLLM: Full support with tensor parallelism
- SGLang: Optimized for high-throughput serving
- Docker Model Runner: Containerized deployment
For local setup guidance, check our how to run Kimi locally guide — the process is similar for K2.7.
Pricing
K2.7 Code follows the same pricing structure as K2.6:
- Moderato Plan: ~$19/month for generous usage
- Pay-per-token: Competitive rates similar to other open-source model APIs
- Self-hosted: Free (you pay for compute only)
Compare that to Claude Opus 4.8 at $5/$25 per million tokens or GPT-5.5 at ~$5/$15. For coding-focused work where K2.7 matches or beats these models, the economics are compelling.
Who Is Kimi K2.7 Code For?
Perfect for:
- Developers who want an open-source coding agent they can self-host
- Teams building MCP-integrated coding workflows
- Anyone who needs strong tool use without paying Opus/GPT prices
- Developers working within agentic coding pipelines
Maybe not ideal for:
- Pure ML research tasks (MLS Bench shows weakness here)
- General-purpose chat (K2.6 is still better for that)
- Teams that need 1M context (stuck at 256K vs Opus’s 1M)
- Production environments needing the absolute best code quality regardless of cost
The Preserve Thinking Innovation
Let me expand on this because it’s genuinely interesting. Traditional LLMs treat each turn independently — the model’s chain-of-thought doesn’t carry over. K2.7 Code’s Preserve Thinking mode forces the reasoning to persist.
In practice, this means:
- Turn 1: You describe a complex refactoring task
- Turn 2: You ask for the next file — the model remembers its reasoning about the architecture decisions it made
- Turn 3: You point out an edge case — it can reason about how that affects its prior decisions
This is huge for multi-step coding workflows where context about why matters as much as what.
Running K2.7 Code Locally
With native INT4 quantization, you can run K2.7 Code on hardware that would be impractical for the full-precision model. The 32B activated parameters mean inference is comparable to running a 32B dense model — manageable on high-end consumer GPUs or small clusters.
For deployment guides, see our local running guide and API setup walkthrough.
Frequently Asked Questions
Is Kimi K2.7 Code really open-source?
Yes, under a Modified MIT license. You can download the weights from HuggingFace, self-host, fine-tune, and use commercially. The “modified” part typically involves attribution requirements and some usage restrictions, but for most developer use cases it’s functionally open-source.
How does K2.7 Code compare to DeepSeek V4 Pro?
Both are open-source MoE models from Chinese AI labs. DeepSeek V4 Pro has a higher SWE-bench score (~85%) and is cheaper per token ($0.44/$0.87 per M), but K2.7 Code excels on tool use (MCPMark) and has a larger context window (256K vs 128K). See our full comparison.
Can I run K2.7 Code on my local machine?
With the INT4 quantized version, yes — if you have sufficient VRAM. The 32B activated parameters make inference similar to dense 32B models. You’ll want at least 24GB VRAM for comfortable inference, or multiple GPUs for the full-precision model. vLLM, SGLang, and Docker Model Runner are all supported.
Should I upgrade from K2.6 to K2.7 Code?
If your primary use case is coding, absolutely. The 21.8% improvement on coding benchmarks and 30% reduction in thinking tokens make it a clear upgrade. If you use K2.6 for general-purpose tasks, multimodal work, or agent swarms, keep K2.6 for those — K2.7 is a coding specialist.
What is Preserve Thinking mode?
It’s K2.7 Code’s approach to maintaining reasoning chains across multiple conversation turns. Instead of resetting the internal chain-of-thought with each message, the model preserves its reasoning context, leading to more coherent multi-step coding workflows.
How does K2.7 Code beat Opus 4.8 on tool use but score lower overall?
MCPMark specifically tests the model’s ability to correctly invoke tools via the Model Context Protocol. K2.7 Code’s agentic fine-tuning focused heavily on tool calling patterns. On other benchmarks that test raw code generation or novel problem-solving, Opus 4.8’s larger effective compute still wins. Different benchmarks test different capabilities.
Bottom Line
Kimi K2.7 Code is the best open-source coding agent available today for tool-integrated development workflows. It won’t beat Claude Fable 5 on raw SWE-bench scores, and GPT-5.5 still edges it on pure code generation. But for the price (free to self-host, ~$19/mo for API) and the openness (Modified MIT), it’s an incredible value proposition.
The fact that an open-source model now beats Opus 4.8 on MCP tool use is a watershed moment. The gap between open and closed is narrowing faster than anyone expected.