Running an LLM directly on an iPhone — no internet, no cloud API, no latency — is now possible with Apple’s Core AI framework. Announced at WWDC 2026, Core AI gives developers a path to deploy language models on-device, using the Neural Engine and GPU in modern iPhones.
This isn’t an abstraction over Apple’s built-in Siri models. It’s a framework for bringing your own models to iPhone hardware. Want to run a fine-tuned 3B parameter model for your app’s specific use case? Core AI makes that possible.
This guide covers everything we know so far: hardware requirements, the model conversion pipeline, quantization strategies, which models actually fit, and what performance to expect. Note that Core AI is currently in beta — we’ll update with real code examples once the SDK is publicly released.
What is Core AI?
Core AI is Apple’s framework for on-device machine learning inference, specifically optimized for language models. It sits alongside Core ML (which handles vision, audio, and general ML) but is purpose-built for the transformer architectures that power LLMs.
Key capabilities:
- Run transformer models on iPhone’s Neural Engine, GPU, and CPU
- Optimized attention mechanisms for Apple Silicon
- KV-cache management for efficient multi-turn conversations
- Streaming token generation
- Memory-mapped model loading (fast startup)
Think of it as Apple’s answer to the question: “How do we let developers run models like what powers Siri intelligence, but for their own use cases?”
Hardware requirements: which iPhones work?
Not every iPhone can run meaningful LLMs. The Neural Engine capability and available RAM are the limiting factors.
| iPhone | RAM | Neural Engine | Max Model Size (approx) | Usable? |
|---|---|---|---|---|
| iPhone 16 Pro / Pro Max | 8 GB | 16-core | ~4B parameters (4-bit) | ✅ Best option |
| iPhone 16 / 16 Plus | 8 GB | 16-core | ~4B parameters (4-bit) | ✅ Good |
| iPhone 15 Pro / Pro Max | 8 GB | 16-core | ~4B parameters (4-bit) | ✅ Good |
| iPhone 15 / 15 Plus | 6 GB | 16-core | ~2-3B parameters (4-bit) | ⚠️ Limited |
| iPhone 14 Pro | 6 GB | 16-core | ~2-3B parameters (4-bit) | ⚠️ Limited |
| iPhone 14 / older | 4-6 GB | Varies | Not practical | ❌ Too constrained |
The sweet spot is iPhone 15 Pro and newer with 8GB RAM. You need RAM headroom for the operating system, your app, and the model — so a 4-bit quantized 3-4B parameter model is the practical maximum on current hardware.
For comparison, running LLMs on Apple Silicon Macs gives you 16-192GB of unified memory. iPhone is much more constrained, but the framework handles memory management aggressively.
The model conversion pipeline
Core AI doesn’t run PyTorch, GGUF, or Hugging Face models directly. You need to convert models to Apple’s optimized format. The pipeline looks like:
PyTorch (.pt/.safetensors) → Core AI Converter → .mlaimodel bundle
Step 1: Start with a compatible model
Your source model needs to be a standard transformer architecture. Models that work:
- Llama-family architectures (Llama, CodeLlama, TinyLlama)
- Qwen architectures
- Mistral/Mixtral architectures
- Phi architectures
- Gemma architectures
- Custom models using standard attention patterns
Models that are tricky or unsupported:
- Mixture-of-experts models (too large, routing overhead)
- Models with custom CUDA kernels (need pure PyTorch implementation)
- Models larger than ~7B parameters (won’t fit on device after quantization)
Step 2: Quantization
Quantization is mandatory for iPhone deployment. Full-precision (FP16) models are far too large. Core AI supports:
- 4-bit quantization (INT4): Best balance of quality and size. A 3B model goes from ~6GB (FP16) to ~1.5GB. Minimal quality loss for most tasks.
- 4-bit with group quantization: Slightly better quality than naive INT4, same size.
- Mixed precision: Critical layers stay at higher precision, others at 4-bit.
The quantization happens during conversion, not at runtime. You ship the quantized model with your app.
For context on quantization tradeoffs, the same principles that apply to running models locally on Mac apply here — just with tighter constraints.
Step 3: Conversion with Core AI Tools
Apple provides command-line tools (part of Xcode 27) for conversion:
# Conceptual — exact CLI syntax pending public release
coreai-convert \
--input ./model/pytorch_model.safetensors \
--config ./model/config.json \
--quantization int4-grouped \
--output ./MyModel.mlaimodel
The converter handles:
- Weight quantization
- Architecture optimization for Neural Engine
- Attention kernel selection
- KV-cache configuration
- Memory layout optimization
Step 4: Bundle with your app
The converted .mlaimodel bundle goes into your Xcode project. It’s included in your app bundle, downloaded on demand via App Thinning, or fetched from your server at runtime.
For models over 500MB (which most useful LLMs are), Apple recommends on-demand resources so the model isn’t included in the initial download.
Which models actually fit on iPhone?
Here’s what’s practical given the hardware constraints:
Recommended models (3-4B parameters, 4-bit)
| Model | Parameters | Quantized Size | Good For |
|---|---|---|---|
| Phi-4 Mini | 3.8B | ~2.0 GB | General chat, reasoning |
| Llama 3.2 3B | 3B | ~1.7 GB | General purpose |
| Qwen 2.5 3B | 3B | ~1.7 GB | Multilingual, coding |
| Gemma 2 2B | 2.6B | ~1.4 GB | Light tasks, fast |
| SmolLM 2 1.7B | 1.7B | ~0.9 GB | Very fast, limited capability |
Pushing the limits (7B parameters, aggressive quantization)
| Model | Parameters | Quantized Size | Notes |
|---|---|---|---|
| Llama 3.1 8B | 8B | ~4.5 GB (3-bit) | Barely fits, slow on iPhone 16 Pro |
| Mistral 7B | 7.3B | ~4.0 GB (3-bit) | Tight fit, reduced quality at 3-bit |
| Qwen 2.5 7B | 7.6B | ~4.2 GB (3-bit) | Only Pro models with 8GB |
The 7B models are technically possible on 8GB devices but leave very little RAM for your app and the OS. Expect thermal throttling during extended use. For reliable production apps, stick to 3-4B models.
Expected performance
Based on Apple’s WWDC demos and early developer beta reports:
| Model Size | Device | Tokens/second | First Token Latency |
|---|---|---|---|
| 2-3B (4-bit) | iPhone 16 Pro | 20-30 tok/s | ~300ms |
| 2-3B (4-bit) | iPhone 15 Pro | 15-25 tok/s | ~400ms |
| 3-4B (4-bit) | iPhone 16 Pro | 12-20 tok/s | ~500ms |
| 7-8B (3-bit) | iPhone 16 Pro | 5-10 tok/s | ~1.5s |
For context: comfortable reading speed is about 4-5 tokens per second. So even the smallest models generate faster than users can read, making streaming responses feel smooth. The 3-4B range hits the sweet spot of usable speed and decent quality.
Compare this to local inference on Apple Silicon Macs where you’d see 50-100+ tok/s for similar models — iPhone is roughly 3-5x slower, but still very usable.
Use cases that make sense on-device
Not every AI feature belongs on-device. Here’s where on-device LLMs shine vs. when you should use a cloud API:
Great for on-device:
- Text summarization (process locally, no data leaves device)
- Autocomplete and writing suggestions
- Smart reply generation
- Local document Q&A
- Code explanation and commenting
- Grammar and style checking
- Accessibility features (describing UI elements)
- Offline-first apps
Better with cloud APIs:
- Complex reasoning tasks
- Long document analysis (context window limits on-device)
- Image generation
- Multi-turn conversations with deep context
- Tasks requiring frontier model intelligence
The privacy angle is compelling. An on-device LLM means user data never leaves the phone. For health apps, financial apps, private messaging, and enterprise use — this is a major advantage over cloud APIs.
Memory management and battery impact
Core AI handles memory aggressively:
- Memory-mapped loading: Models load from flash storage pages, not all into RAM at once
- KV-cache eviction: Old conversation context is evicted when memory pressure rises
- Background limits: Models are unloaded when your app goes to background
- Thermal management: Inference rate throttles as the device heats up
Battery impact varies by usage pattern:
- Short bursts (autocomplete, quick generation): minimal impact
- Extended generation (long conversations): noticeable battery drain, similar to gaming
- Always-on monitoring: not recommended without careful duty cycling
Apple provides APIs to check thermal state and available memory before starting inference. Good apps should degrade gracefully — fall back to simpler heuristics or defer AI tasks when the device is constrained.
What’s coming: roadmap and expectations
Core AI is in beta with Xcode 27 and iOS 18.x (the 2026 release). Here’s what we expect:
Available now (beta):
- Model conversion pipeline
- Basic inference API
- Neural Engine optimization
- 4-bit quantization
Expected at public release (Fall 2026):
- Complete documentation and code samples
- App Store submission support for apps with bundled models
- On-demand resource delivery for models
- Performance profiling tools in Instruments
Likely future (2027+):
- Larger models as iPhone hardware improves
- On-device fine-tuning (adapter layers)
- Multi-model orchestration
- Integration with Core ML for multimodal pipelines
Getting started today
While Core AI is in beta, here’s how to prepare:
- Identify your use case: What AI feature would benefit from on-device inference? Think privacy, offline access, and latency.
- Choose your model: Pick a 3-4B parameter model that fits your task. Test it first on Mac to validate quality.
- Install Xcode 27 beta: Access Core AI tools and documentation through Apple’s developer portal.
- Experiment with conversion: Try converting small models and running them in the Simulator.
- Profile on device: Real performance testing requires an actual iPhone 15 Pro or newer.
If you’re building an AI-powered app architecture, consider a hybrid approach: on-device models for latency-sensitive, privacy-critical features, and cloud APIs for capability-demanding tasks. The Language Model Protocol makes switching between on-device and cloud providers seamless.
How Core AI compares to other on-device options
Core AI isn’t the only way to run models on mobile:
| Approach | Platform | Ease of Use | Performance | Model Ecosystem |
|---|---|---|---|---|
| Core AI | Apple only | High (first-party) | Best on iPhone | Apple-optimized |
| llama.cpp (mobile) | Cross-platform | Medium | Good | GGUF models |
| MediaPipe LLM | Cross-platform | High | Good | Limited models |
| ONNX Runtime Mobile | Cross-platform | Medium | Good | ONNX-compatible |
Core AI will have the best performance on Apple hardware because it’s optimized for the Neural Engine. Cross-platform options like llama.cpp work on both iOS and Android but can’t leverage Apple’s hardware as efficiently.
For developers committed to the Apple ecosystem, Core AI is the clear choice. For cross-platform apps, consider llama.cpp or MediaPipe to maintain a single model deployment across iOS and Android.
Frequently Asked Questions
Can I run ChatGPT or Claude on my iPhone offline with Core AI?
No. Core AI runs open-source or custom models that you convert and bundle with your app. You cannot run proprietary models like GPT-5 or Claude locally. For open models in the 3-4B range, think Phi-4 Mini, Llama 3.2 3B, or Qwen 2.5 3B — these are the best models for local inference at this size.
Will Core AI work on iPad and Mac too?
Yes. Core AI works across Apple platforms. On iPad Pro (M-series chips) and Mac, you have significantly more RAM and compute available, so you can run larger models (7B-13B on iPad, much larger on Mac). The same converted model works across devices, scaling performance to available hardware.
How does Core AI differ from the built-in Apple Foundation Models?
Apple Foundation Models (AFM) are Apple’s own pre-trained models that power Siri, Writing Tools, and system features. You can’t customize or replace them. Core AI is a framework for running your own models on-device. They’re complementary: AFM handles system features, Core AI handles your app’s custom AI features.
What about App Store review — will Apple approve apps with bundled LLMs?
Apple has confirmed that apps using Core AI with properly converted models will pass App Store review, subject to standard content and capability guidelines. The model must be converted through the official pipeline, and your app must handle model output responsibly (content filtering, appropriate error handling).
How much storage space will a bundled LLM add to my app?
A typical 3B 4-bit model is 1.5-2GB. Apple recommends using on-demand resources so the model isn’t included in the initial app download. Users download the model when they first use the AI feature. This keeps initial app size small while enabling large model deployment.
Can I fine-tune models for Core AI, or only use pre-trained ones?
At launch, Core AI supports inference only — you convert pre-trained (or already fine-tuned) models. You’d do any fine-tuning on a Mac or cloud GPU using standard tools (PyTorch, Hugging Face), then convert the result for Core AI deployment. On-device fine-tuning (adapter training) is expected in a future release.