Apple now has two machine learning frameworks, and the naming doesn’t make the distinction obvious. Core ML has been around since 2017. Core AI launched at WWDC 2026. They serve fundamentally different purposes, target different model architectures, and optimize for different workloads.
Here’s the practical breakdown: when to use which, what each excels at, and whether you need to migrate anything.
The One-Sentence Difference
Core ML = traditional ML models (classification, object detection, regression, small transformers). Optimized for fast, lightweight inference on structured tasks.
Core AI = generative AI models (LLMs, diffusion models, large transformers). Optimized for autoregressive generation, attention-heavy architectures, and models with billions of parameters.
If your model produces a label, a bounding box, or a number — use Core ML. If your model generates text, images, or code — use Core AI.
Comparison Table
| Feature | Core ML | Core AI |
|---|---|---|
| Primary use case | Classification, detection, regression | Text generation, image generation, reasoning |
| Typical model size | 10 MB – 500 MB | 1 GB – 50 GB+ |
| Parameter count | Thousands to hundreds of millions | Billions |
| Architecture focus | CNNs, RNNs, small transformers, tree models | Decoder transformers, MoE, diffusion |
| Output type | Labels, scores, coordinates | Token sequences, images, embeddings |
| Model format | .mlmodel / .mlpackage | .coreaimodel |
| Conversion source | PyTorch, TensorFlow, ONNX, scikit-learn | PyTorch (via coreai-torch) |
| Quantization | Built-in (INT8, INT16, FP16) | coreai-optimization (INT4, INT8, mixed) |
| GPU backend | Metal (generic compute) | Metal 4 (transformer-optimized kernels) |
| Neural Engine | Yes | Yes (priority routing) |
| KV cache management | No | Yes (built-in) |
| Streaming output | No | Yes (token-by-token) |
| Memory management | Simple (load/unload) | Advanced (paged attention, memory pressure handling) |
| Minimum iOS | iOS 11+ | iOS 20+ |
| Xcode integration | Model preview, performance reports | Core AI Debugger, layer profiling |
| App Store distribution | Embedded in app bundle | On Demand Resources recommended |
When to Use Core ML
Core ML is not deprecated. It remains the right choice for:
Image classification and object detection
If you’re detecting objects in photos, classifying images, or running pose estimation — Core ML is purpose-built for this. Models are small, inference is fast (often under 10ms), and the ecosystem has years of optimized models available.
import CoreML
let model = try VNCoreMLModel(for: MyImageClassifier().model)
let request = VNCoreMLRequest(model: model) { request, error in
guard let results = request.results as? [VNClassificationObservation] else { return }
print(results.first?.identifier) // "cat", "dog", etc.
}
Tabular data and structured prediction
XGBoost models, random forests, and linear models converted from scikit-learn. These are tiny, fast, and perfect for Core ML’s execution model.
Audio classification
Sound detection, music genre classification, keyword spotting — small models that process audio frames and return labels.
Real-time video processing
Camera effects, background removal, style transfer — models that need to run at 30-60 fps on each video frame. Core ML’s low overhead makes this feasible. Core AI’s startup costs and memory footprint would be overkill.
Recommendations and embeddings (small models)
If you’re computing item similarity or user preferences with a model under 100M parameters, Core ML handles this cleanly.
When to Use Core AI
Core AI exists because Core ML wasn’t designed for generative workloads:
Text generation (LLMs)
Any model that produces text token-by-token. Chatbots, writing assistants, code generation, summarization. Core AI handles the autoregressive decode loop, KV cache, and sampling — things Core ML would require you to implement manually (poorly).
Image generation (diffusion models)
Stable Diffusion and similar architectures. Core AI manages the iterative denoising loop and the large memory footprint these models require.
Complex reasoning
Models that need multiple passes, chain-of-thought generation, or tool use. Core AI’s streaming interface and memory management handle long-running generation without blocking the UI thread.
Models over ~500M parameters
Once you cross into billion-parameter territory, Core ML’s memory management falls short. Core AI’s paged attention and disk-backed caching can handle models that exceed available RAM by streaming layers.
Mixture-of-Experts architectures
Apple’s own AFM Core Advanced (20B sparse, 1-4B active) uses MoE. Core AI has specific routing logic for sparse models — loading only the active experts for each token. This architecture is common in efficient LLMs, including those based on the same approach as DeepSeek and Gemini.
Architecture Differences Under the Hood
The frameworks aren’t just different APIs over the same engine. They make fundamentally different decisions:
Memory model
Core ML: Loads the entire model into memory at once. Works fine for a 50 MB image classifier. Fails catastrophically for a 4 GB LLM on a device with 8 GB total RAM.
Core AI: Uses progressive loading and paged attention. Only the active layers and KV cache pages reside in memory. Weight pages are demand-loaded from disk. This is why Core AI can run a 7B model on a device where Core ML would OOM trying to load it.
Execution model
Core ML: Single forward pass. Input goes in, output comes out. Optimized for minimal latency on a single inference call.
Core AI: Iterative generation. For an LLM, this means running the model hundreds of times (once per output token), managing state between iterations, and handling early stopping. The framework owns this loop — you don’t write it manually.
GPU utilization
Core ML: Uses Metal compute shaders that handle general matrix operations. Fine for convolutions and small attention blocks.
Core AI: Uses Metal 4 kernels specifically designed for multi-head attention, rotary positional embeddings, and grouped query attention. These aren’t general-purpose — they’re transformer-specific and significantly faster for those operations.
Quantization approach
Core ML: Supports weight-only quantization (INT8, INT16). Adequate for small models where precision matters less than size.
Core AI: Supports weight and activation quantization (INT4, INT8, mixed precision). The coreai-optimization tool includes calibration-based quantization that preserves quality for specific tasks. Similar in concept to AWQ and GPTQ approaches but targeting Metal 4’s instruction set.
Can They Work Together?
Yes. A single app can use both frameworks. Common patterns:
-
Core ML for preprocessing, Core AI for generation — Use a Core ML vision model to extract features from an image, then pass those features to a Core AI multimodal LLM for description or reasoning.
-
Core ML for classification, Core AI for explanation — Detect an object with Core ML (fast, cheap), then use Core AI to generate a natural language explanation of what was detected.
-
Core ML for real-time, Core AI for background — Run Core ML models on the camera feed at 30fps for live annotations. When the user asks a question about what they’re seeing, trigger a Core AI generation in the background.
The frameworks share the same Neural Engine and Metal stack, so they can coexist without conflicting resource allocation.
Migration Guide: Moving Transformer Models from Core ML to Core AI
If you have transformer models currently running through Core ML (which was technically possible but suboptimal), here’s the migration path:
Step 1: Export to PyTorch
If you converted from PyTorch originally, use your source model. If you only have the .mlmodel, you’ll need to retrace back to PyTorch weights.
Step 2: Convert with coreai-torch
coreai-torch convert \
--model ./transformer-model.pt \
--architecture decoder-only \
--output ./model.coreaimodel
Step 3: Optimize for target devices
coreai-optimization quantize \
--model ./model.coreaimodel \
--precision int4 \
--target-device iphone \
--output ./model-phone.coreaimodel
Step 4: Update Swift code
Replace Core ML inference calls with Core AI’s streaming generation API. The biggest change: Core AI uses async/await for generation, while Core ML was synchronous.
Performance Benchmarks
Apple hasn’t published official benchmarks comparing the same model on both frameworks, but based on the WWDC sessions and developer testing:
| Metric | Core ML (7B transformer) | Core AI (7B transformer) |
|---|---|---|
| Model load time | 8-12 seconds | 2-4 seconds |
| Time to first token | 1200-2000ms | 150-300ms |
| Tokens per second | 5-8 | 30-50 |
| Peak memory | 6+ GB (full model) | 3-4 GB (paged) |
| Battery drain (5min gen) | High (generic Metal) | Moderate (optimized kernels) |
The performance gap is dramatic because Core AI was designed for this workload. Core ML running a transformer is like using a spreadsheet app to edit photos — it technically works but isn’t what the tool is built for.
What About the GPU vs CPU question?
Both frameworks use the GPU (via Metal) and Neural Engine. Neither does CPU-only inference by default. The difference is that Core AI’s Metal 4 kernels are specifically optimized for transformer attention patterns, while Core ML uses more general compute shaders.
On Apple silicon, the unified memory architecture means there’s no VRAM constraint in the traditional sense — both frameworks access the same memory pool. The constraint is total system memory, and Core AI manages it more efficiently for large models through paging.
The Future: Will Core ML Be Deprecated?
Almost certainly not in the near term. Core ML serves a massive installed base of apps doing traditional ML tasks. These apps work fine and don’t need generative AI. Apple isn’t going to break millions of apps to force a migration.
The likely path:
- Core ML continues for classification, detection, and small model inference
- Core AI handles all generative AI development going forward
- New model types and architectures get Core AI support only
- Core ML receives maintenance updates but no major new features
If you’re starting a new project in 2026 and it involves any form of text/image generation, start with Core AI. If you’re maintaining existing Core ML integrations for classification tasks, there’s no urgent reason to migrate.
Comparison to Cross-Platform Alternatives
If you’re not Apple-exclusive, here’s how both compare to platform-agnostic options:
| Solution | Platform | LLM Support | Ecosystem |
|---|---|---|---|
| Core AI | Apple only | Excellent | Apple models + converted PyTorch |
| Core ML | Apple only | Poor | Broad ML model support |
| Ollama | Mac, Linux, Windows | Excellent | Thousands of GGUF models |
| llama.cpp | All platforms | Excellent | Open source, community-driven |
| ONNX Runtime | All platforms | Moderate | Enterprise focused |
For Apple-native apps shipping to the App Store, Core AI is the clear winner for generative tasks. For developer tools, servers, or cross-platform apps, Ollama and similar tools remain more flexible.
FAQ
Can I use Core AI for image classification?
Technically yes (you could run a vision transformer), but Core ML is more efficient and easier for classification tasks. Core AI’s overhead (memory management, streaming, KV cache) adds complexity you don’t need for a simple classification forward pass.
Do both frameworks support the Neural Engine?
Yes. Both route compatible operations to the Neural Engine for power efficiency. Core AI additionally uses the Neural Engine for specific transformer operations that Core ML would send to the GPU.
Can I convert a Core ML model directly to Core AI format?
No direct converter exists. You need to go back to the PyTorch source and convert using coreai-torch. The internal representations are fundamentally different.
Which framework does Apple’s Foundation Models API use internally?
Foundation Models uses Core AI for on-device generative inference and Core ML for supporting tasks like embedding computation. The routing is automatic — you don’t choose at the Foundation Models layer.
What if my model is 500M parameters — which framework?
It depends on the task. If it’s a 500M classification model (like a ViT), use Core ML. If it’s a 500M generative model (small LLM or diffusion model), use Core AI. The parameter count alone doesn’t determine the choice — the workload pattern does.
Is there a performance penalty for using Core ML with transformers?
Yes, significant. Core ML doesn’t have KV caching, paged attention, or transformer-specific Metal kernels. A transformer model running through Core ML will be 4-8x slower for generation tasks compared to the same model through Core AI. For a single forward pass (not generation), the gap is smaller but Core AI still wins due to optimized attention kernels.