📝 Tutorials
· 8 min read

What is Apple Core AI: On-Device LLMs Without API Costs (2026)


Core AI is Apple’s new framework for running large language models and generative AI models directly on Apple silicon. Announced at WWDC 2026, it gives developers a first-party way to deploy LLMs on iPhone, iPad, Mac, and Vision Pro — without sending data to a server, without paying per-token API fees, and without depending on third-party inference tools.

If you’ve been running models locally using Ollama or llama.cpp on your Mac, think of Core AI as Apple’s official answer: tighter hardware integration, better power efficiency, and a Swift-native API that fits into the rest of your app’s codebase.

Why Core AI Exists

The AI development landscape has split into two camps: cloud APIs and local inference. Cloud APIs (OpenAI, Anthropic, Google) give you powerful models but cost money per request and require sending user data off-device. Local inference gives you privacy and zero marginal cost, but has historically required cobbling together open-source tools without official platform support.

Core AI bridges this gap for Apple developers. You get:

  • Zero API costs — inference runs on the device’s hardware
  • Complete privacy — data never leaves the device
  • Native performance — optimized for Apple silicon’s unified memory architecture
  • First-party tooling — integrated into Xcode with debugging and profiling support

For context on why on-device inference on Apple silicon works so well: the unified memory architecture means the GPU, CPU, and Neural Engine share the same memory pool. No copying tensors between system RAM and VRAM. Core AI exploits this with zero-copy data paths that eliminate the overhead you’d see on traditional GPU architectures.

How Core AI Works

The framework has four main components:

1. Swift API for Inference

The runtime API. You load a model, pass input, and get output — all in Swift.

import CoreAI

let model = try CoreAIModel(named: "MyCustomLLM")
let response = try await model.generate(
    prompt: "Explain quantum computing in one paragraph",
    parameters: .init(maxTokens: 256, temperature: 0.7)
)
print(response.text)

The API supports streaming (token-by-token output), batch inference, and multimodal inputs (text + images). It runs on the device’s Neural Engine and GPU, with automatic fallback to CPU for operations the accelerators can’t handle.

2. coreai-torch: PyTorch Model Conversion

You don’t train models in Core AI format. You train in PyTorch (or any framework that exports to PyTorch), then convert using the coreai-torch command-line tool.

coreai-torch convert \
    --model ./my-model.pt \
    --architecture transformer \
    --output ./MyCustomLLM.coreaimodel

The converter handles:

  • Weight format transformation
  • Architecture validation
  • Operator mapping (PyTorch ops → Metal 4 kernels)
  • Metadata embedding for runtime optimization

Supported architectures include standard transformers, Mixture-of-Experts, and diffusion model variants. If your model uses custom operators not in the supported set, you’ll need to decompose them into supported primitives.

3. coreai-optimization: Quantization and Compression

Running a full-precision model on a phone isn’t practical. coreai-optimization handles model compression:

coreai-optimization quantize \
    --model ./MyCustomLLM.coreaimodel \
    --precision int4 \
    --calibration-data ./calibration-set.json \
    --output ./MyCustomLLM-q4.coreaimodel

Supported quantization formats:

  • INT8 — ~2x compression, minimal quality loss
  • INT4 — ~4x compression, acceptable quality for most tasks
  • Mixed precision — keeps attention layers at higher precision while quantizing feed-forward layers

If you’re familiar with GGUF, GPTQ, and AWQ formats from the open-source ecosystem, Core AI’s quantization is conceptually similar but outputs Apple’s proprietary .coreaimodel format optimized for Metal 4.

4. Core AI Debugger

Integrated into Xcode 27, the debugger lets you:

  • Profile inference latency per layer
  • Monitor memory consumption during generation
  • Visualize attention patterns
  • Compare output quality across quantization levels
  • Set breakpoints on specific token generation steps

This is something the open-source local AI ecosystem doesn’t have. When running models through Ollama, debugging performance issues means reading logs and guessing. Core AI Debugger gives you Instruments-level visibility into what’s happening inside the model.

Hardware Requirements

Core AI runs on any device with Apple silicon, but performance varies dramatically:

DeviceRAMPractical Model SizeTokens/sec (approx)
iPhone 16 Pro8 GBUp to 3B (INT4)15-25
iPad Pro M416 GBUp to 7B (INT4)30-50
MacBook Air M424 GBUp to 13B (INT4)40-60
Mac Studio M4 Ultra192 GBUp to 70B+ (INT4)60-100
Vision Pro16 GBUp to 7B (INT4)25-40

The key constraint is unified memory. A 7B parameter model at INT4 quantization needs roughly 4 GB of memory for weights alone, plus working memory for KV cache and activations. On an 8 GB iPhone, that leaves minimal headroom for the rest of the system.

For Mac users running AI locally, Core AI provides better performance than general-purpose solutions because it uses Metal 4 kernels specifically optimized for transformer operations. The ahead-of-time compilation means the first inference is fast too — no JIT warmup.

What Models Can You Run?

Core AI supports any model you can convert from PyTorch, but Apple clearly designed it for specific architectures:

Officially supported:

  • Decoder-only transformers (GPT-style, LLaMA-style)
  • Mixture-of-Experts transformers
  • Diffusion models (image generation)
  • Encoder-decoder transformers (translation, summarization)

What this means in practice:

  • Fine-tuned LLaMA variants: yes
  • Whisper (speech-to-text): yes
  • Stable Diffusion variants: yes
  • Custom small models for classification: use Core ML instead (more efficient for this use case)

Apple’s own AFM Core (~3B) and AFM Core Advanced (20B sparse, 1-4B active MoE) run on Core AI internally. Third-party developers can create models of similar or smaller scale for on-device deployment.

Core AI vs. Running Models with Ollama

If you’re already running models locally on your Mac, here’s how Core AI compares to the Ollama workflow:

AspectCore AIOllama on Mac
PlatformiOS, iPadOS, macOS, visionOSmacOS (and Linux/Windows)
LanguageSwiftAny (HTTP API)
Model format.coreaimodelGGUF
GPU backendMetal 4 (optimized)Metal (generic)
DistributionApp StoreSelf-managed
DebuggingXcode integratedLogs
Mobile deploymentYesNo
Model ecosystemConvert your ownThousands available

The biggest difference: Core AI deploys to phones. If you’re building a Mac-only tool and want access to the broadest model library, Ollama remains excellent. If you’re shipping an iOS/iPadOS app with on-device AI, Core AI is the only first-party option.

Who Should Use Core AI

Use Core AI if:

  • You’re building a native Apple app (Swift/SwiftUI)
  • You need on-device inference on iPhones or iPads
  • Privacy requirements prevent sending data to external APIs
  • You want to avoid per-token API costs
  • You need Xcode-integrated debugging and profiling
  • You’re distributing through the App Store

Consider alternatives if:

  • You need cross-platform support (Android + iOS)
  • You want access to frontier-scale models (GPT-5.5, Claude Opus) that won’t fit on-device
  • You don’t have a PyTorch model to convert
  • Your app already uses cloud AI APIs and the cost is acceptable

The Foundation Models Framework Connection

Core AI handles the low-level inference. The Foundation Models framework sits above it, providing the high-level API for building AI features:

Your App

Foundation Models API (high-level: skills, profiles, routing)

Core AI (low-level: model loading, inference, memory management)

Apple Silicon (Neural Engine, GPU, Metal 4)

You can use Core AI directly for full control, or use Foundation Models for convenience. The Foundation Models framework automatically chooses between on-device (Core AI) and server (Private Cloud Compute) based on model requirements and device capability.

Getting Started

  1. Install Xcode 27 beta (requires Apple Developer Program membership and Apple silicon Mac)
  2. Prepare your model — train or fine-tune in PyTorch
  3. Convert — use coreai-torch to produce a .coreaimodel file
  4. Optimize — run coreai-optimization for your target devices
  5. Integrate — import CoreAI in your Swift project and load the model
  6. Test — use Device Hub to push to a physical device and profile

Apple provides sample models and conversion scripts in the developer documentation. The WWDC 2026 session videos cover end-to-end workflows.

Practical Considerations

Model size limits for App Store distribution: Apple hasn’t published hard limits, but On Demand Resources allow downloading models post-install. Expect apps to ship a small model bundled (under 200 MB) with larger models downloaded on first launch.

Battery impact: The Neural Engine is significantly more power-efficient than the GPU for supported operations. Core AI routes to the Neural Engine where possible, falling back to GPU for unsupported ops. Expect 20-30% better battery life compared to running the same model through generic Metal compute.

Latency: Ahead-of-time compilation means no cold-start penalty. First token latency for a 3B model on iPhone 16 Pro is under 200ms. For comparison, hitting a cloud API typically adds 500-2000ms of network latency before you see the first token.

Memory pressure: On devices with limited RAM (8 GB iPhones), Core AI aggressively manages memory. If the system is under pressure, inference may be throttled or suspended. Design your app to handle this gracefully.

FAQ

Can I use Core AI without Xcode 27?

No. The Core AI SDK, conversion tools, and debugger all require Xcode 27. The runtime ships with iOS 20, macOS 17, iPadOS 20, and visionOS 4.

Does Core AI support training or fine-tuning on device?

No. Core AI is inference-only. You train models externally (PyTorch on a GPU workstation or cloud) and deploy the converted model to devices.

What happens when a model is too large for the device?

If you’re using the Foundation Models framework, it automatically routes to Private Cloud Compute. If you’re using Core AI directly, model loading will fail with an out-of-memory error — you need to provide a smaller or more aggressively quantized variant.

Can I run open-source models like LLaMA or Mistral through Core AI?

Yes, provided you convert them using coreai-torch. Any PyTorch model with a supported architecture can be converted. You’re responsible for licensing compliance with whatever model you deploy.

Is Core AI available for Objective-C projects?

The API is Swift-only. You can bridge to Objective-C code in your project, but the Core AI calls themselves must be in Swift.

How does this compare to Core ML for running transformers?

Core ML can technically run transformer models, but it’s optimized for smaller, traditional ML workloads. Core AI has specific optimizations for generative AI: KV cache management, autoregressive decoding, and Metal 4 kernels purpose-built for attention operations. For LLMs and diffusion models, Core AI will be significantly faster. For a detailed comparison, see Core AI vs Core ML.