Jun 9, 2026 · 8 min read

What is Apple Core AI: On-Device LLMs Without API Costs (2026)

Core AI is Apple’s new framework for running large language models and generative AI models directly on Apple silicon. Announced at WWDC 2026, it gives developers a first-party way to deploy LLMs on iPhone, iPad, Mac, and Vision Pro — without sending data to a server, without paying per-token API fees, and without depending on third-party inference tools.

If you’ve been running models locally using Ollama or llama.cpp on your Mac, think of Core AI as Apple’s official answer: tighter hardware integration, better power efficiency, and a Swift-native API that fits into the rest of your app’s codebase.

Why Core AI Exists

The AI development landscape has split into two camps: cloud APIs and local inference. Cloud APIs (OpenAI, Anthropic, Google) give you powerful models but cost money per request and require sending user data off-device. Local inference gives you privacy and zero marginal cost, but has historically required cobbling together open-source tools without official platform support.

Core AI bridges this gap for Apple developers. You get:

Zero API costs — inference runs on the device’s hardware
Complete privacy — data never leaves the device
Native performance — optimized for Apple silicon’s unified memory architecture
First-party tooling — integrated into Xcode with debugging and profiling support

For context on why on-device inference on Apple silicon works so well: the unified memory architecture means the GPU, CPU, and Neural Engine share the same memory pool. No copying tensors between system RAM and VRAM. Core AI exploits this with zero-copy data paths that eliminate the overhead you’d see on traditional GPU architectures.

How Core AI Works

The framework has four main components:

1. Swift API for Inference

The runtime API. You load a model, pass input, and get output — all in Swift.

import CoreAI

let model = try CoreAIModel(named: "MyCustomLLM")
let response = try await model.generate(
    prompt: "Explain quantum computing in one paragraph",
    parameters: .init(maxTokens: 256, temperature: 0.7)
)
print(response.text)

The API supports streaming (token-by-token output), batch inference, and multimodal inputs (text + images). It runs on the device’s Neural Engine and GPU, with automatic fallback to CPU for operations the accelerators can’t handle.

2. coreai-torch: PyTorch Model Conversion

You don’t train models in Core AI format. You train in PyTorch (or any framework that exports to PyTorch), then convert using the coreai-torch command-line tool.

coreai-torch convert \
    --model ./my-model.pt \
    --architecture transformer \
    --output ./MyCustomLLM.coreaimodel

The converter handles:

Weight format transformation
Architecture validation
Operator mapping (PyTorch ops → Metal 4 kernels)
Metadata embedding for runtime optimization

Supported architectures include standard transformers, Mixture-of-Experts, and diffusion model variants. If your model uses custom operators not in the supported set, you’ll need to decompose them into supported primitives.

3. coreai-optimization: Quantization and Compression

Running a full-precision model on a phone isn’t practical. coreai-optimization handles model compression:

coreai-optimization quantize \
    --model ./MyCustomLLM.coreaimodel \
    --precision int4 \
    --calibration-data ./calibration-set.json \
    --output ./MyCustomLLM-q4.coreaimodel

Supported quantization formats:

INT8 — ~2x compression, minimal quality loss
INT4 — ~4x compression, acceptable quality for most tasks
Mixed precision — keeps attention layers at higher precision while quantizing feed-forward layers

If you’re familiar with GGUF, GPTQ, and AWQ formats from the open-source ecosystem, Core AI’s quantization is conceptually similar but outputs Apple’s proprietary .coreaimodel format optimized for Metal 4.

4. Core AI Debugger

Integrated into Xcode 27, the debugger lets you:

Profile inference latency per layer
Monitor memory consumption during generation
Visualize attention patterns
Compare output quality across quantization levels
Set breakpoints on specific token generation steps

This is something the open-source local AI ecosystem doesn’t have. When running models through Ollama, debugging performance issues means reading logs and guessing. Core AI Debugger gives you Instruments-level visibility into what’s happening inside the model.

Hardware Requirements

Core AI runs on any device with Apple silicon, but performance varies dramatically:

Device	RAM	Practical Model Size	Tokens/sec (approx)
iPhone 16 Pro	8 GB	Up to 3B (INT4)	15-25
iPad Pro M4	16 GB	Up to 7B (INT4)	30-50
MacBook Air M4	24 GB	Up to 13B (INT4)	40-60
Mac Studio M4 Ultra	192 GB	Up to 70B+ (INT4)	60-100
Vision Pro	16 GB	Up to 7B (INT4)	25-40

The key constraint is unified memory. A 7B parameter model at INT4 quantization needs roughly 4 GB of memory for weights alone, plus working memory for KV cache and activations. On an 8 GB iPhone, that leaves minimal headroom for the rest of the system.

For Mac users running AI locally, Core AI provides better performance than general-purpose solutions because it uses Metal 4 kernels specifically optimized for transformer operations. The ahead-of-time compilation means the first inference is fast too — no JIT warmup.

What Models Can You Run?

Core AI supports any model you can convert from PyTorch, but Apple clearly designed it for specific architectures:

Officially supported:

Decoder-only transformers (GPT-style, LLaMA-style)
Mixture-of-Experts transformers
Diffusion models (image generation)
Encoder-decoder transformers (translation, summarization)

What this means in practice:

Fine-tuned LLaMA variants: yes
Whisper (speech-to-text): yes
Stable Diffusion variants: yes
Custom small models for classification: use Core ML instead (more efficient for this use case)

Apple’s own AFM Core (~3B) and AFM Core Advanced (20B sparse, 1-4B active MoE) run on Core AI internally. Third-party developers can create models of similar or smaller scale for on-device deployment.

Core AI vs. Running Models with Ollama

If you’re already running models locally on your Mac, here’s how Core AI compares to the Ollama workflow:

Aspect	Core AI	Ollama on Mac
Platform	iOS, iPadOS, macOS, visionOS	macOS (and Linux/Windows)
Language	Swift	Any (HTTP API)
Model format	.coreaimodel	GGUF
GPU backend	Metal 4 (optimized)	Metal (generic)
Distribution	App Store	Self-managed
Debugging	Xcode integrated	Logs
Mobile deployment	Yes	No
Model ecosystem	Convert your own	Thousands available

The biggest difference: Core AI deploys to phones. If you’re building a Mac-only tool and want access to the broadest model library, Ollama remains excellent. If you’re shipping an iOS/iPadOS app with on-device AI, Core AI is the only first-party option.

Who Should Use Core AI

Use Core AI if:

You’re building a native Apple app (Swift/SwiftUI)
You need on-device inference on iPhones or iPads
Privacy requirements prevent sending data to external APIs
You want to avoid per-token API costs
You need Xcode-integrated debugging and profiling
You’re distributing through the App Store

Consider alternatives if:

You need cross-platform support (Android + iOS)
You want access to frontier-scale models (GPT-5.5, Claude Opus) that won’t fit on-device
You don’t have a PyTorch model to convert
Your app already uses cloud AI APIs and the cost is acceptable

The Foundation Models Framework Connection

Core AI handles the low-level inference. The Foundation Models framework sits above it, providing the high-level API for building AI features:

Your App
    ↓
Foundation Models API (high-level: skills, profiles, routing)
    ↓
Core AI (low-level: model loading, inference, memory management)
    ↓
Apple Silicon (Neural Engine, GPU, Metal 4)

You can use Core AI directly for full control, or use Foundation Models for convenience. The Foundation Models framework automatically chooses between on-device (Core AI) and server (Private Cloud Compute) based on model requirements and device capability.

Getting Started

Install Xcode 27 beta (requires Apple Developer Program membership and Apple silicon Mac)
Prepare your model — train or fine-tune in PyTorch
Convert — use coreai-torch to produce a .coreaimodel file
Optimize — run coreai-optimization for your target devices
Integrate — import CoreAI in your Swift project and load the model
Test — use Device Hub to push to a physical device and profile

Apple provides sample models and conversion scripts in the developer documentation. The WWDC 2026 session videos cover end-to-end workflows.

Practical Considerations

Model size limits for App Store distribution: Apple hasn’t published hard limits, but On Demand Resources allow downloading models post-install. Expect apps to ship a small model bundled (under 200 MB) with larger models downloaded on first launch.

Battery impact: The Neural Engine is significantly more power-efficient than the GPU for supported operations. Core AI routes to the Neural Engine where possible, falling back to GPU for unsupported ops. Expect 20-30% better battery life compared to running the same model through generic Metal compute.

Latency: Ahead-of-time compilation means no cold-start penalty. First token latency for a 3B model on iPhone 16 Pro is under 200ms. For comparison, hitting a cloud API typically adds 500-2000ms of network latency before you see the first token.

Memory pressure: On devices with limited RAM (8 GB iPhones), Core AI aggressively manages memory. If the system is under pressure, inference may be throttled or suspended. Design your app to handle this gracefully.

FAQ

Can I use Core AI without Xcode 27?

No. The Core AI SDK, conversion tools, and debugger all require Xcode 27. The runtime ships with iOS 20, macOS 17, iPadOS 20, and visionOS 4.

Does Core AI support training or fine-tuning on device?

No. Core AI is inference-only. You train models externally (PyTorch on a GPU workstation or cloud) and deploy the converted model to devices.

What happens when a model is too large for the device?

If you’re using the Foundation Models framework, it automatically routes to Private Cloud Compute. If you’re using Core AI directly, model loading will fail with an out-of-memory error — you need to provide a smaller or more aggressively quantized variant.

Can I run open-source models like LLaMA or Mistral through Core AI?

Yes, provided you convert them using coreai-torch. Any PyTorch model with a supported architecture can be converted. You’re responsible for licensing compliance with whatever model you deploy.

Is Core AI available for Objective-C projects?

The API is Swift-only. You can bridge to Objective-C code in your project, but the Core AI calls themselves must be in Swift.

How does this compare to Core ML for running transformers?

Core ML can technically run transformer models, but it’s optimized for smaller, traditional ML workloads. Core AI has specific optimizations for generative AI: KV cache management, autoregressive decoding, and Metal 4 kernels purpose-built for attention operations. For LLMs and diffusion models, Core AI will be significantly faster. For a detailed comparison, see Core AI vs Core ML.