Jun 9, 2026 · 9 min read

How to Run LLMs on iPhone with Core AI (2026 Guide)

Running an LLM directly on an iPhone — no internet, no cloud API, no latency — is now possible with Apple’s Core AI framework. Announced at WWDC 2026, Core AI gives developers a path to deploy language models on-device, using the Neural Engine and GPU in modern iPhones.

This isn’t an abstraction over Apple’s built-in Siri models. It’s a framework for bringing your own models to iPhone hardware. Want to run a fine-tuned 3B parameter model for your app’s specific use case? Core AI makes that possible.

This guide covers everything we know so far: hardware requirements, the model conversion pipeline, quantization strategies, which models actually fit, and what performance to expect. Note that Core AI is currently in beta — we’ll update with real code examples once the SDK is publicly released.

What is Core AI?

Core AI is Apple’s framework for on-device machine learning inference, specifically optimized for language models. It sits alongside Core ML (which handles vision, audio, and general ML) but is purpose-built for the transformer architectures that power LLMs.

Key capabilities:

Run transformer models on iPhone’s Neural Engine, GPU, and CPU
Optimized attention mechanisms for Apple Silicon
KV-cache management for efficient multi-turn conversations
Streaming token generation
Memory-mapped model loading (fast startup)

Think of it as Apple’s answer to the question: “How do we let developers run models like what powers Siri intelligence, but for their own use cases?”

Hardware requirements: which iPhones work?

Not every iPhone can run meaningful LLMs. The Neural Engine capability and available RAM are the limiting factors.

iPhone	RAM	Neural Engine	Max Model Size (approx)	Usable?
iPhone 16 Pro / Pro Max	8 GB	16-core	~4B parameters (4-bit)	✅ Best option
iPhone 16 / 16 Plus	8 GB	16-core	~4B parameters (4-bit)	✅ Good
iPhone 15 Pro / Pro Max	8 GB	16-core	~4B parameters (4-bit)	✅ Good
iPhone 15 / 15 Plus	6 GB	16-core	~2-3B parameters (4-bit)	⚠️ Limited
iPhone 14 Pro	6 GB	16-core	~2-3B parameters (4-bit)	⚠️ Limited
iPhone 14 / older	4-6 GB	Varies	Not practical	❌ Too constrained

The sweet spot is iPhone 15 Pro and newer with 8GB RAM. You need RAM headroom for the operating system, your app, and the model — so a 4-bit quantized 3-4B parameter model is the practical maximum on current hardware.

For comparison, running LLMs on Apple Silicon Macs gives you 16-192GB of unified memory. iPhone is much more constrained, but the framework handles memory management aggressively.

The model conversion pipeline

Core AI doesn’t run PyTorch, GGUF, or Hugging Face models directly. You need to convert models to Apple’s optimized format. The pipeline looks like:

PyTorch (.pt/.safetensors) → Core AI Converter → .mlaimodel bundle

Step 1: Start with a compatible model

Your source model needs to be a standard transformer architecture. Models that work:

Llama-family architectures (Llama, CodeLlama, TinyLlama)
Qwen architectures
Mistral/Mixtral architectures
Phi architectures
Gemma architectures
Custom models using standard attention patterns

Models that are tricky or unsupported:

Mixture-of-experts models (too large, routing overhead)
Models with custom CUDA kernels (need pure PyTorch implementation)
Models larger than ~7B parameters (won’t fit on device after quantization)

Step 2: Quantization

Quantization is mandatory for iPhone deployment. Full-precision (FP16) models are far too large. Core AI supports:

4-bit quantization (INT4): Best balance of quality and size. A 3B model goes from ~6GB (FP16) to ~1.5GB. Minimal quality loss for most tasks.
4-bit with group quantization: Slightly better quality than naive INT4, same size.
Mixed precision: Critical layers stay at higher precision, others at 4-bit.

The quantization happens during conversion, not at runtime. You ship the quantized model with your app.

For context on quantization tradeoffs, the same principles that apply to running models locally on Mac apply here — just with tighter constraints.

Step 3: Conversion with Core AI Tools

Apple provides command-line tools (part of Xcode 27) for conversion:

# Conceptual — exact CLI syntax pending public release
coreai-convert \
  --input ./model/pytorch_model.safetensors \
  --config ./model/config.json \
  --quantization int4-grouped \
  --output ./MyModel.mlaimodel

The converter handles:

Weight quantization
Architecture optimization for Neural Engine
Attention kernel selection
KV-cache configuration
Memory layout optimization

Step 4: Bundle with your app

The converted .mlaimodel bundle goes into your Xcode project. It’s included in your app bundle, downloaded on demand via App Thinning, or fetched from your server at runtime.

For models over 500MB (which most useful LLMs are), Apple recommends on-demand resources so the model isn’t included in the initial download.

Which models actually fit on iPhone?

Here’s what’s practical given the hardware constraints:

Recommended models (3-4B parameters, 4-bit)

Model	Parameters	Quantized Size	Good For
Phi-4 Mini	3.8B	~2.0 GB	General chat, reasoning
Llama 3.2 3B	3B	~1.7 GB	General purpose
Qwen 2.5 3B	3B	~1.7 GB	Multilingual, coding
Gemma 2 2B	2.6B	~1.4 GB	Light tasks, fast
SmolLM 2 1.7B	1.7B	~0.9 GB	Very fast, limited capability

Pushing the limits (7B parameters, aggressive quantization)

Model	Parameters	Quantized Size	Notes
Llama 3.1 8B	8B	~4.5 GB (3-bit)	Barely fits, slow on iPhone 16 Pro
Mistral 7B	7.3B	~4.0 GB (3-bit)	Tight fit, reduced quality at 3-bit
Qwen 2.5 7B	7.6B	~4.2 GB (3-bit)	Only Pro models with 8GB

The 7B models are technically possible on 8GB devices but leave very little RAM for your app and the OS. Expect thermal throttling during extended use. For reliable production apps, stick to 3-4B models.

Expected performance

Based on Apple’s WWDC demos and early developer beta reports:

Model Size	Device	Tokens/second	First Token Latency
2-3B (4-bit)	iPhone 16 Pro	20-30 tok/s	~300ms
2-3B (4-bit)	iPhone 15 Pro	15-25 tok/s	~400ms
3-4B (4-bit)	iPhone 16 Pro	12-20 tok/s	~500ms
7-8B (3-bit)	iPhone 16 Pro	5-10 tok/s	~1.5s

For context: comfortable reading speed is about 4-5 tokens per second. So even the smallest models generate faster than users can read, making streaming responses feel smooth. The 3-4B range hits the sweet spot of usable speed and decent quality.

Compare this to local inference on Apple Silicon Macs where you’d see 50-100+ tok/s for similar models — iPhone is roughly 3-5x slower, but still very usable.

Use cases that make sense on-device

Not every AI feature belongs on-device. Here’s where on-device LLMs shine vs. when you should use a cloud API:

Great for on-device:

Text summarization (process locally, no data leaves device)
Autocomplete and writing suggestions
Smart reply generation
Local document Q&A
Code explanation and commenting
Grammar and style checking
Accessibility features (describing UI elements)
Offline-first apps

Better with cloud APIs:

Complex reasoning tasks
Long document analysis (context window limits on-device)
Image generation
Multi-turn conversations with deep context
Tasks requiring frontier model intelligence

The privacy angle is compelling. An on-device LLM means user data never leaves the phone. For health apps, financial apps, private messaging, and enterprise use — this is a major advantage over cloud APIs.

Memory management and battery impact

Core AI handles memory aggressively:

Memory-mapped loading: Models load from flash storage pages, not all into RAM at once
KV-cache eviction: Old conversation context is evicted when memory pressure rises
Background limits: Models are unloaded when your app goes to background
Thermal management: Inference rate throttles as the device heats up

Battery impact varies by usage pattern:

Short bursts (autocomplete, quick generation): minimal impact
Extended generation (long conversations): noticeable battery drain, similar to gaming
Always-on monitoring: not recommended without careful duty cycling

Apple provides APIs to check thermal state and available memory before starting inference. Good apps should degrade gracefully — fall back to simpler heuristics or defer AI tasks when the device is constrained.

What’s coming: roadmap and expectations

Core AI is in beta with Xcode 27 and iOS 18.x (the 2026 release). Here’s what we expect:

Available now (beta):

Model conversion pipeline
Basic inference API
Neural Engine optimization
4-bit quantization

Expected at public release (Fall 2026):

Complete documentation and code samples
App Store submission support for apps with bundled models
On-demand resource delivery for models
Performance profiling tools in Instruments

Likely future (2027+):

Larger models as iPhone hardware improves
On-device fine-tuning (adapter layers)
Multi-model orchestration
Integration with Core ML for multimodal pipelines

Getting started today

While Core AI is in beta, here’s how to prepare:

Identify your use case: What AI feature would benefit from on-device inference? Think privacy, offline access, and latency.
Choose your model: Pick a 3-4B parameter model that fits your task. Test it first on Mac to validate quality.
Install Xcode 27 beta: Access Core AI tools and documentation through Apple’s developer portal.
Experiment with conversion: Try converting small models and running them in the Simulator.
Profile on device: Real performance testing requires an actual iPhone 15 Pro or newer.

If you’re building an AI-powered app architecture, consider a hybrid approach: on-device models for latency-sensitive, privacy-critical features, and cloud APIs for capability-demanding tasks. The Language Model Protocol makes switching between on-device and cloud providers seamless.

How Core AI compares to other on-device options

Core AI isn’t the only way to run models on mobile:

Approach	Platform	Ease of Use	Performance	Model Ecosystem
Core AI	Apple only	High (first-party)	Best on iPhone	Apple-optimized
llama.cpp (mobile)	Cross-platform	Medium	Good	GGUF models
MediaPipe LLM	Cross-platform	High	Good	Limited models
ONNX Runtime Mobile	Cross-platform	Medium	Good	ONNX-compatible

Core AI will have the best performance on Apple hardware because it’s optimized for the Neural Engine. Cross-platform options like llama.cpp work on both iOS and Android but can’t leverage Apple’s hardware as efficiently.

For developers committed to the Apple ecosystem, Core AI is the clear choice. For cross-platform apps, consider llama.cpp or MediaPipe to maintain a single model deployment across iOS and Android.

Frequently Asked Questions

Can I run ChatGPT or Claude on my iPhone offline with Core AI?

No. Core AI runs open-source or custom models that you convert and bundle with your app. You cannot run proprietary models like GPT-5 or Claude locally. For open models in the 3-4B range, think Phi-4 Mini, Llama 3.2 3B, or Qwen 2.5 3B — these are the best models for local inference at this size.

Will Core AI work on iPad and Mac too?

Yes. Core AI works across Apple platforms. On iPad Pro (M-series chips) and Mac, you have significantly more RAM and compute available, so you can run larger models (7B-13B on iPad, much larger on Mac). The same converted model works across devices, scaling performance to available hardware.

How does Core AI differ from the built-in Apple Foundation Models?

Apple Foundation Models (AFM) are Apple’s own pre-trained models that power Siri, Writing Tools, and system features. You can’t customize or replace them. Core AI is a framework for running your own models on-device. They’re complementary: AFM handles system features, Core AI handles your app’s custom AI features.

What about App Store review — will Apple approve apps with bundled LLMs?

Apple has confirmed that apps using Core AI with properly converted models will pass App Store review, subject to standard content and capability guidelines. The model must be converted through the official pipeline, and your app must handle model output responsibly (content filtering, appropriate error handling).

How much storage space will a bundled LLM add to my app?

A typical 3B 4-bit model is 1.5-2GB. Apple recommends using on-demand resources so the model isn’t included in the initial app download. Users download the model when they first use the AI feature. This keeps initial app size small while enabling large model deployment.

Can I fine-tune models for Core AI, or only use pre-trained ones?

At launch, Core AI supports inference only — you convert pre-trained (or already fine-tuned) models. You’d do any fine-tuning on a Mac or cloud GPU using standard tools (PyTorch, Hugging Face), then convert the result for Core AI deployment. On-device fine-tuning (adapter training) is expected in a future release.