The raw results, what surprised me when I ran real meetings through the rig

Last time was about the harness: one SwiftUI app, three tabs, a memory gauge in the top bar, a runs.jsonl file catching every trace. That was the scaffolding. This is what the model actually did once I started pushing real meetings through it.

A word on the numbers first, because I want to be fair to my own work and to the model. Everything below is measured, the rig logged every run, but read each figure as a snapshot, not a verdict. It's the best I've pulled out of the model so far, with this harness and these prompts, working with Claude Opus on both. Where a result's poor, the failure might be mine and not the model's, and a sharper prompter might clear a bar I couldn't. Mine will move as the harness improves. That's the point of measuring it this way: not what the model can do in a lab, but what a builder can squeeze out of it today, me and my tools and the model together. These are Apple's model on its own too, its limits and its failure rates. The head-to-head against the open models I bench it against is still moving under my hands, so I'm keeping that back for now.

Start with the wins, because the criticism only lands if I'm honest about what's good first. The model ships in the OS, so there's no download and no API bill. And the headline feature, guided generation, is the real thing. You annotate a Swift struct and the framework guarantees the output decodes into that type. I pushed it across nine tiers of rising complexity, flat fields up to runtime-built schemas, over dozens of passages, and schema validity held at 100% at every tier. Zero decode failures, even on the schemas the published small-model work expects to fall over. That's not 'usually works'. It's a grammar-level guarantee, and for an app that spends its life tagging and sorting, it's the best thing about the model.

Then you look at what actually fills those perfectly-shaped structures, and it gets more interesting.

It cannot keep its hands off the tools

Apple's tool calling is mechanically clean. When a question genuinely needs a tool, the model picks the right one and fills the arguments in correctly, near enough every time. Bury a city inside a sentence about a flight, ask about the weather, and it pulls the city out and calls the weather tool. On using tools, it's strong.

On not using them, it's a disaster. I gave it a pile of prompts that needed no tool at all, and it left the tools alone only 12% of the time. The rest of the time it grabbed one anyway. Ask it what color the sky is and it called the weather tool three times. Ask it to tell a joke and it fired three different tools at once before it got to the joke. It behaves as if the mere presence of a tool is an instruction to use it. For an in-meeting feature that's a real liability, because a model that fires off a contact lookup when you only asked a general question is doing work nobody asked for. Sterner instructions helped but didn't fix it. What fixed it was pulling the decision out into its own small routing pass, instead of asking the model to chat and decide about tools in the same breath. Calling tools and knowing when to leave them alone are two different skills, and the model's much further along on the first.

The structured-output paradox

This is the finding that shaped the most code, and it's the sharp edge hiding behind that lovely 100% validity number.

Guided generation guarantees the shape of the answer. It guarantees nothing about whether the answer's true. I fed it passages where no valid answer existed, the equivalent of asking it to pull chemical elements out of a paragraph about democracy. A sensible system declines. Across nine of those traps it declined zero times. It fabricated a perfectly typed lie every time instead: 'Mercury' handed back as an organization, 'Democracy' given a founding year of 0. It fills every slot in the schema with something plausible because the schema says the slot has to be filled, and the model's got no way to say 'nothing here'. The same reflex floods ordinary work. Ask it to list the entities in a document and it returns 300 to 500 against a real count of about 50, because it can't hand back an empty list. Give it an explicit 'other' bucket as an escape hatch for a batch of obvious garbage, and it never once reaches for it, force-sorting all of it into real types instead.

So the mental model I've landed on is simple: a guaranteed-valid struct is not guaranteed-valid information. The structure guarantee can actively hurt you, because it quietly turns 'I don't know' into a confident, well-typed lie. Anything the model extracts now gets checked back against the source before I trust it.

There's a second twist, and it cost me a day. The exact same prompt, on the exact same transcript, ran clean as free text but tripped a content-safety refusal the moment I routed it through guided generation. Structured mode runs a stricter filter than plain text. And in one classify-the-note task the input guardrail fired before the model even got to classify: every empty-bodied input came back as a safety violation rather than a label, a clean zero on exactly that category, not because the model got it wrong but because it never got to answer. So the rule I've taped to the monitor: test structured mode against your weird inputs, not your tidy ones. The failures cluster on the edges, and they're invisible until you go looking.

The filter that refuses a battle scene

That filter has a sharper problem in my corner of the world, because a good slice of the people who'd use iXnote work in games. The moment a transcript turns to the dynamics of a battle scene, the pacing of a fight, how an enemy behaves when it's cornered, the filter's far more likely to decide the conversation has wandered out of bounds and shut it down, and more likely again once I've asked for structured output. There's nothing unsafe about a design meeting describing combat mechanics. It's a normal day's work for a large industry, and most of them would laugh at the idea it needs a guardrail. The model doesn't always agree. And you can't fully prompt your way out of it, a stubborn handful of perfectly ordinary prompts refused under every mitigation I tried. So I engineered around it instead. The rig now spots a safety refusal, counts it as a skipped batch, and shows that count in the interface, so the failure's visible and honest rather than a summary that quietly stalls.

The one I would have bet money on

I saved this for last, because it's the finding I was most sure would go the other way, and it's the one that matters most for what iXnote is.

The task sounds trivial. Give the model three or four candidate ideas already pulled from a transcript, plus a short note of what the user tends to keep, and ask which ones to keep. No generation, no facts, just comparison. It reads like the easiest language task there is. The model couldn't do it.

What I got was a yes-bias so strong it amounts to no judgment at all. In one controlled test where the right answer was 'reject' about two-thirds of the time, the model picked 'reject' zero times out of two hundred. It's structurally unable to say 'none of these'. Reframe it as a ranking and it hands back canned position lists that track the order I fed the items in, not their content. It contradicts itself, arguing in prose to keep an item while its structured answer drops it. I tried everything reasonable. Shuffling the order on every call to break the positional tell. Greedy decoding to get its real preference. Majority voting across a stack of shuffled rounds. None of it gave me a usable signal. Across repeated replays of the same meeting the verdict was a coin flip, so I switched the feature off. Two other results point the same way: ask it to count the distinct ideas in a passage and it settles on a comfortable number and rarely budges off it, and set it to judge its own earlier output and it marks itself far harder than a human does. It's not just a weak judge. It's a biased one, and you can't predict which way the bias runs.

What replaced it was boring deterministic code. A small rule-based gate over cheap signals: is the title noun-shaped, is it already known, has it got a rare or capitalized word in it. It held up across every replay where the model had fallen over. The boring heuristic beat the language model at the language-ish job.

I'm convinced that gap isn't a bug to prompt away. It's the line. The model's genuinely good at the cheap, tireless work, catching what was said, tidying it, dropping it in the right bin, and I want all of that. What it can't do is the part that actually builds understanding, which is deciding what matters and how it connects to everything else you know. That's not a lookup. It's a judgment, and on the evidence of my own rig the model can't take it off your hands even when you beg it to. Which is the whole bet iXnote's built on. Linking is thinking. Let the machine lift the heavy, boring weight of capture, because that frees you up, and keep the rep that makes you smarter for yourself.

Two walls worth knowing before you build

Two hard edges you can't coax away. You design around them or you move off them.

First, latency. The model takes the better part of half a minute to produce its first token, and prewarming it barely helped. Once it's going it's quick enough, so the answer wasn't to fight it, it was to design around it. iXnote does its in-meeting work as a windowed background pass that runs every so often, rather than trying to react to every sentence as you speak. Put the model where nobody's watching a spinner and it's a fine engine.

Second, the context window's a cliff, not a slope. The budget's 4096 tokens total, and 'total' is the trap, because your instructions, your prompt, and the output it's generating all share the one pool. With a realistic schema the room left for your actual input is a good deal smaller than that. Cross the line and you don't get a degraded answer, you get a thrown error and nothing at all. You size your windows small and you live inside them.

Where this leaves me

There's a number I keep coming back to. I measured the model's entity extraction against a baseline that uses no model at all, just statistics, and the model came out a little ahead, for roughly 140 times the compute. That's the question worth asking at every step. Not 'can the model do this', because it usually can, a bit, but 'is the little it adds worth what it costs'. Surprisingly often the honest answer was no, and the app got faster and better every time I took a model call out instead of putting one in.

That's the real shape of building on a small on-device model. It's brilliant at a few things, hopeless at a few others, and the craft is knowing which is which before you commit. None of it was knowable from a keynote slide. It came from building the rig, running real meetings through it, and reading the traces. The next question is the practical one, what you actually build once you know all this. That's the next post.

Testing Apple's LLM