Jun 9, 2026 · 9 min read

Core AI vs Core ML: Which Apple Framework Should You Use in 2026?

Apple now has two machine learning frameworks, and the naming doesn’t make the distinction obvious. Core ML has been around since 2017. Core AI launched at WWDC 2026. They serve fundamentally different purposes, target different model architectures, and optimize for different workloads.

Here’s the practical breakdown: when to use which, what each excels at, and whether you need to migrate anything.

The One-Sentence Difference

Core ML = traditional ML models (classification, object detection, regression, small transformers). Optimized for fast, lightweight inference on structured tasks.

Core AI = generative AI models (LLMs, diffusion models, large transformers). Optimized for autoregressive generation, attention-heavy architectures, and models with billions of parameters.

If your model produces a label, a bounding box, or a number — use Core ML. If your model generates text, images, or code — use Core AI.

Comparison Table

Feature	Core ML	Core AI
Primary use case	Classification, detection, regression	Text generation, image generation, reasoning
Typical model size	10 MB – 500 MB	1 GB – 50 GB+
Parameter count	Thousands to hundreds of millions	Billions
Architecture focus	CNNs, RNNs, small transformers, tree models	Decoder transformers, MoE, diffusion
Output type	Labels, scores, coordinates	Token sequences, images, embeddings
Model format	.mlmodel / .mlpackage	.coreaimodel
Conversion source	PyTorch, TensorFlow, ONNX, scikit-learn	PyTorch (via coreai-torch)
Quantization	Built-in (INT8, INT16, FP16)	coreai-optimization (INT4, INT8, mixed)
GPU backend	Metal (generic compute)	Metal 4 (transformer-optimized kernels)
Neural Engine	Yes	Yes (priority routing)
KV cache management	No	Yes (built-in)
Streaming output	No	Yes (token-by-token)
Memory management	Simple (load/unload)	Advanced (paged attention, memory pressure handling)
Minimum iOS	iOS 11+	iOS 20+
Xcode integration	Model preview, performance reports	Core AI Debugger, layer profiling
App Store distribution	Embedded in app bundle	On Demand Resources recommended

When to Use Core ML

Core ML is not deprecated. It remains the right choice for:

Image classification and object detection

If you’re detecting objects in photos, classifying images, or running pose estimation — Core ML is purpose-built for this. Models are small, inference is fast (often under 10ms), and the ecosystem has years of optimized models available.

import CoreML

let model = try VNCoreMLModel(for: MyImageClassifier().model)
let request = VNCoreMLRequest(model: model) { request, error in
    guard let results = request.results as? [VNClassificationObservation] else { return }
    print(results.first?.identifier) // "cat", "dog", etc.
}

Tabular data and structured prediction

XGBoost models, random forests, and linear models converted from scikit-learn. These are tiny, fast, and perfect for Core ML’s execution model.

Audio classification

Sound detection, music genre classification, keyword spotting — small models that process audio frames and return labels.

Real-time video processing

Camera effects, background removal, style transfer — models that need to run at 30-60 fps on each video frame. Core ML’s low overhead makes this feasible. Core AI’s startup costs and memory footprint would be overkill.

Recommendations and embeddings (small models)

If you’re computing item similarity or user preferences with a model under 100M parameters, Core ML handles this cleanly.

When to Use Core AI

Core AI exists because Core ML wasn’t designed for generative workloads:

Text generation (LLMs)

Any model that produces text token-by-token. Chatbots, writing assistants, code generation, summarization. Core AI handles the autoregressive decode loop, KV cache, and sampling — things Core ML would require you to implement manually (poorly).

Image generation (diffusion models)

Stable Diffusion and similar architectures. Core AI manages the iterative denoising loop and the large memory footprint these models require.

Complex reasoning

Models that need multiple passes, chain-of-thought generation, or tool use. Core AI’s streaming interface and memory management handle long-running generation without blocking the UI thread.

Models over ~500M parameters

Once you cross into billion-parameter territory, Core ML’s memory management falls short. Core AI’s paged attention and disk-backed caching can handle models that exceed available RAM by streaming layers.

Mixture-of-Experts architectures

Apple’s own AFM Core Advanced (20B sparse, 1-4B active) uses MoE. Core AI has specific routing logic for sparse models — loading only the active experts for each token. This architecture is common in efficient LLMs, including those based on the same approach as DeepSeek and Gemini.

Architecture Differences Under the Hood

The frameworks aren’t just different APIs over the same engine. They make fundamentally different decisions:

Memory model

Core ML: Loads the entire model into memory at once. Works fine for a 50 MB image classifier. Fails catastrophically for a 4 GB LLM on a device with 8 GB total RAM.

Core AI: Uses progressive loading and paged attention. Only the active layers and KV cache pages reside in memory. Weight pages are demand-loaded from disk. This is why Core AI can run a 7B model on a device where Core ML would OOM trying to load it.

Execution model

Core ML: Single forward pass. Input goes in, output comes out. Optimized for minimal latency on a single inference call.

Core AI: Iterative generation. For an LLM, this means running the model hundreds of times (once per output token), managing state between iterations, and handling early stopping. The framework owns this loop — you don’t write it manually.

GPU utilization

Core ML: Uses Metal compute shaders that handle general matrix operations. Fine for convolutions and small attention blocks.

Core AI: Uses Metal 4 kernels specifically designed for multi-head attention, rotary positional embeddings, and grouped query attention. These aren’t general-purpose — they’re transformer-specific and significantly faster for those operations.

Quantization approach

Core ML: Supports weight-only quantization (INT8, INT16). Adequate for small models where precision matters less than size.

Core AI: Supports weight and activation quantization (INT4, INT8, mixed precision). The coreai-optimization tool includes calibration-based quantization that preserves quality for specific tasks. Similar in concept to AWQ and GPTQ approaches but targeting Metal 4’s instruction set.

Can They Work Together?

Yes. A single app can use both frameworks. Common patterns:

Core ML for preprocessing, Core AI for generation — Use a Core ML vision model to extract features from an image, then pass those features to a Core AI multimodal LLM for description or reasoning.
Core ML for classification, Core AI for explanation — Detect an object with Core ML (fast, cheap), then use Core AI to generate a natural language explanation of what was detected.
Core ML for real-time, Core AI for background — Run Core ML models on the camera feed at 30fps for live annotations. When the user asks a question about what they’re seeing, trigger a Core AI generation in the background.

The frameworks share the same Neural Engine and Metal stack, so they can coexist without conflicting resource allocation.

Migration Guide: Moving Transformer Models from Core ML to Core AI

If you have transformer models currently running through Core ML (which was technically possible but suboptimal), here’s the migration path:

Step 1: Export to PyTorch

If you converted from PyTorch originally, use your source model. If you only have the .mlmodel, you’ll need to retrace back to PyTorch weights.

Step 2: Convert with coreai-torch

coreai-torch convert \
    --model ./transformer-model.pt \
    --architecture decoder-only \
    --output ./model.coreaimodel

Step 3: Optimize for target devices

coreai-optimization quantize \
    --model ./model.coreaimodel \
    --precision int4 \
    --target-device iphone \
    --output ./model-phone.coreaimodel

Step 4: Update Swift code

Replace Core ML inference calls with Core AI’s streaming generation API. The biggest change: Core AI uses async/await for generation, while Core ML was synchronous.

Performance Benchmarks

Apple hasn’t published official benchmarks comparing the same model on both frameworks, but based on the WWDC sessions and developer testing:

Metric	Core ML (7B transformer)	Core AI (7B transformer)
Model load time	8-12 seconds	2-4 seconds
Time to first token	1200-2000ms	150-300ms
Tokens per second	5-8	30-50
Peak memory	6+ GB (full model)	3-4 GB (paged)
Battery drain (5min gen)	High (generic Metal)	Moderate (optimized kernels)

The performance gap is dramatic because Core AI was designed for this workload. Core ML running a transformer is like using a spreadsheet app to edit photos — it technically works but isn’t what the tool is built for.

What About the GPU vs CPU question?

Both frameworks use the GPU (via Metal) and Neural Engine. Neither does CPU-only inference by default. The difference is that Core AI’s Metal 4 kernels are specifically optimized for transformer attention patterns, while Core ML uses more general compute shaders.

On Apple silicon, the unified memory architecture means there’s no VRAM constraint in the traditional sense — both frameworks access the same memory pool. The constraint is total system memory, and Core AI manages it more efficiently for large models through paging.

The Future: Will Core ML Be Deprecated?

Almost certainly not in the near term. Core ML serves a massive installed base of apps doing traditional ML tasks. These apps work fine and don’t need generative AI. Apple isn’t going to break millions of apps to force a migration.

The likely path:

Core ML continues for classification, detection, and small model inference
Core AI handles all generative AI development going forward
New model types and architectures get Core AI support only
Core ML receives maintenance updates but no major new features

If you’re starting a new project in 2026 and it involves any form of text/image generation, start with Core AI. If you’re maintaining existing Core ML integrations for classification tasks, there’s no urgent reason to migrate.

Comparison to Cross-Platform Alternatives

If you’re not Apple-exclusive, here’s how both compare to platform-agnostic options:

Solution	Platform	LLM Support	Ecosystem
Core AI	Apple only	Excellent	Apple models + converted PyTorch
Core ML	Apple only	Poor	Broad ML model support
Ollama	Mac, Linux, Windows	Excellent	Thousands of GGUF models
llama.cpp	All platforms	Excellent	Open source, community-driven
ONNX Runtime	All platforms	Moderate	Enterprise focused

For Apple-native apps shipping to the App Store, Core AI is the clear winner for generative tasks. For developer tools, servers, or cross-platform apps, Ollama and similar tools remain more flexible.

FAQ

Can I use Core AI for image classification?

Technically yes (you could run a vision transformer), but Core ML is more efficient and easier for classification tasks. Core AI’s overhead (memory management, streaming, KV cache) adds complexity you don’t need for a simple classification forward pass.

Do both frameworks support the Neural Engine?

Yes. Both route compatible operations to the Neural Engine for power efficiency. Core AI additionally uses the Neural Engine for specific transformer operations that Core ML would send to the GPU.

Can I convert a Core ML model directly to Core AI format?

No direct converter exists. You need to go back to the PyTorch source and convert using coreai-torch. The internal representations are fundamentally different.

Which framework does Apple’s Foundation Models API use internally?

Foundation Models uses Core AI for on-device generative inference and Core ML for supporting tasks like embedding computation. The routing is automatic — you don’t choose at the Foundation Models layer.

What if my model is 500M parameters — which framework?

It depends on the task. If it’s a 500M classification model (like a ViT), use Core ML. If it’s a 500M generative model (small LLM or diffusion model), use Core AI. The parameter count alone doesn’t determine the choice — the workload pattern does.

Is there a performance penalty for using Core ML with transformers?

Yes, significant. Core ML doesn’t have KV caching, paged attention, or transformer-specific Metal kernels. A transformer model running through Core ML will be 4-8x slower for generation tasks compared to the same model through Core AI. For a single forward pass (not generation), the gap is smaller but Core AI still wins due to optimized attention kernels.