iXnote

Dev Diary

WWDC 2026: what Apple just announced for its on-device models

15 June 2026

At WWDC on 8 June, Apple announced its third generation of Foundation Models. Five in total: three that run in the cloud, two on the device. The cloud ones I'll leave to one side, because my interest has always been the model in your pocket. And the new on-device one has no business being there.

It's a 20-billion-parameter model. The one I've spent this series testing is 3 billion. Apple are now running something nearly seven times the size on the same phone, without melting the battery or filling the memory, and the trick that makes that possible is the most interesting thing they showed.

The 20-billion-parameter model, in plain terms

The problem is memory. Normally every weight in a model has to sit in active memory at once, and 20 billion of them is far too big for a phone's budget. Classic mixture-of-experts doesn't save you either, because it picks its experts fresh for every token, which means all of them still have to be loaded and ready to go. The memory bill comes due no matter what.

Apple's move is to change when the choosing happens. The full 20-billion-parameter model lives in flash storage. A small, always-on core sits in active memory. Then, once per prompt, a lightweight predictor reads what you've asked and pages in only the handful of expert weights that request needs. Because the decision is made per prompt and not per token, you pay the loading cost occasionally instead of constantly. It's a chef who keeps salt and oil to hand and walks to the pantry for the one spice tonight's dish needs, rather than tipping the whole pantry onto the counter for every meal. That's the heart of it. Per-prompt routing is what makes a flash-backed model work on a device that small.

The technique under it is Apple's own, a 2025 paper they call Instruction-Following Pruning. A small predictor reads the prompt and switches on only the slices of the model that prompt actually needs. In the paper, a model with 3 billion active parameters matched a dense 9-billion one and beat the equivalent 3-billion dense baseline by 5 to 8 points on maths and coding. Roughly 3 billion of active compute, roughly 9-billion quality.

Two things I like about it. The model can dial its active size up for a hard request and down for an easy one, instead of running one fixed size for everything. That's a clean answer to a problem I keep hitting: small models are jagged, brilliant at one thing and useless at the next, and a model that can flex its own effort is a model that wastes less of it. And it's the first dynamic-sparse model of this kind to ship to consumers at scale, which is a deployment milestone, not just a nice graph in a paper. I want it on the bench.

Where I'm holding the enthusiasm, and there are three places. The clever part is memory engineering, and paging experts out of flash is disk I/O, which could land right on the thing the last model was worst at: speed. Apple's headline quality claim is a preference score against its own previous model, 45.6% of prompts against 23.3%, and nothing else: no MMLU, no SWE-bench, no comparison to anyone else. It tells me people liked the new answers better. It doesn't tell me whether it can do my actual jobs. And the good model is high-end only. The free baseline almost everyone gets is still the 3-billion Core. The 20-billion one wants serious hardware, reportedly phones with 12GB of memory, and the exact device list isn't pinned down yet.

What I'll be watching for

A bigger model doesn't automatically fix the failures I found last time, so the first thing I'll measure is speed. The last model took 18 to 27 seconds to reach its first word, and a flash-backed one could be quicker or slower. There's no calling it from a slide. After that, the refusals, because ordinary professional content got blocked and games material more so, and games is where a real slice of my users work. And then the one I'd have bet money on: ask the model which of three or four ideas from a meeting is worth keeping, and last time it simply couldn't.

That last one is the one I care about most, and not because I expect 20 billion parameters to crack it. A better model raises the floor on the cheap, tireless work, the capturing and tidying and sorting. It doesn't change who decides what matters and how it connects. That stays a human job, and it's the whole reason I build the way I do.

The sleeper: on-device RAG, backed by Spotlight

The announcement I think most people skimmed past is the one most relevant to what I'm building. Apple shipped a fully local retrieval path: a query goes through Spotlight's rebuilt semantic index first, and the results feed straight into the model's context. No cloud, no vector database to stand up, no embeddings to manage. And the index is open to third-party apps, so an app can put its own content into it.

That matters for something like iXnote. A small model with a tiny context window can't be handed the user's whole knowledge graph, so the real trick was never the model, it was fetching the right few things to put in front of it. That retrieval layer is exactly the piece I'd been expecting to build myself. Apple just shipped it, on the device, for free. I haven't tested the retrieval quality, the index is a black box, and 'open to developers' needs proving in practice. But of everything announced, this is the one I'm most keen to wire up.

So: can you even test it yet

Yes. The developer betas of iOS 27 and macOS 27 went live on keynote day, 8 June. There's a new command-line tool, fm chat, that drops you into a session with the on-device model straight from the terminal, plus a Python SDK for poking at it outside an app. The public beta lands in July, the full release in September. The new Siri features ship on their own slower clock, English first, later in the year, but developer access to the model is here now.

Back to the bench

The best kind of keynote hands you a list of things to test, and this was that: a 20-billion-parameter model on a phone, image input on the framework, and a local retrieval layer to feed it.

But first I finish shipping the AI-enhanced iXnote on the engine I've already validated. That's the next post: what worked, what didn't. Then the new models go on the bench and I find out what they really do. Follow along. Results next.

← All posts