Responsible LLM integration starts with a test rig

I'm convinced AI is moving to the edge. More and more of it is going to run on the device in your hand, not in a data center you rent by the token. Two things are driving that. The obvious one is that small language models keep getting better. The one people miss is the harness around the model, the prompts and tools and output formats and memory management that turn a raw engine into something useful. That's where the real gains are, and we're barely started.

Think about car engines.

A 1930s Le Mans Bentley, the Speed Six that won the race in 1929 and 1930, ran a 6.6 litre straight-six. About 180 horsepower, race-tuned closer to 200. A magnificent machine, and the better part of 2 tonnes, riding on the suspension, brakes and tyres of its day. Now park a modern Mazda MX-5 next to it: 130 horsepower from a 1.5 litre engine, a quarter of the displacement, in a car half the weight. On paper the Bentley wins. On any real road the little Mazda is quicker, more capable and far more efficient, and not because of the engine. It's the brakes, the tyres, the suspension, the aero, the gearbox. The engine got smaller and cleaner and more powerful per litre, but the leap was everything built around it.

That's the bet for small language models. The engines are improving, but the harness is where the race is won, and we're early. And the work I need these models to do isn't getting any bigger. As the models improve, more and more of the everyday jobs I actually want to run on a phone come within reach. The target sits still while the engine catches up to it.

So why does the edge matter this much? Because once the model and the harness are good enough together, the economics only point one way. The developer stops paying a token bill for every interaction, because the work happens on hardware the user already owns. The marginal cost of a token at the edge is basically nothing: the hardware is a sunk cost and the power it draws is negligible. But the saved invoice is the small half of the story. The big half is what the user gets:

low latency, because nothing makes a round trip to a server
privacy by construction, because the data doesn't have to leave the device
it keeps working with no connection and no signal
no monthly subscription holding the feature hostage

I won't pretend 'good enough' isn't doing a lot of work in that sentence. It is. But the direction of travel is clear, and as the engines and the harnesses improve together, more workloads have a real reason to move to the edge.

The question I actually wanted to answer is a narrow one. There's now a genuinely capable small model sitting free in hundreds of millions of pockets: the system model iOS 26 hands every app through the Foundation Models framework. No download, no API bill, no server. So the question isn't whether the edge will matter. It's whether the engine already in people's phones is good enough to build a real feature on, and how good my harness has to be to get it there. You don't answer that from a keynote slide. So I built a rig.

There's a personal reason it mattered now. I've held off putting AI into my own product for a long time, because I didn't think the models were ready. I'm still not sure they are. But they're close enough that I don't want to be caught flat-footed when they cross the line, so I wanted tooling that lets me size up each new model the week it lands: new families, quantization tricks, KV cache improvements, whatever runtime gain turns up next. The harness is how I keep pace instead of guessing. It's also, I'll admit, good fun.

One thing about who's doing the testing, because it changes what the results mean. I'm an enthusiastic hobbyist, not a machine-learning researcher who can wring the last drop out of a model. That isn't a disclaimer. It's the whole point. Real-world capability isn't what a model can do in a lab under perfect conditions. It's what a builder like me can get out of it with the docs, the APIs and the tooling the provider actually ships. If I can't get what I need on those terms, then what the model can do in theory doesn't matter, because I can't ship it. And the honest flip side: when something fails, the failure might be mine and not the model's, and a sharper prompter might clear a bar I couldn't. I flag that where I suspect it.

About the numbers: I'm keeping this qualitative. Partly because I watched hundreds of runs go past rather than keeping a tidy spreadsheet, so words are the honest way to report what I saw. And partly because this was hard-won work, and the detail is mine. I'll share what's useful to another builder. The granular measurements and the head-to-head comparisons I'm keeping.

The rig

It's one SwiftUI app, three tabs, three different kinds of test. The aim was simple: throw the same prompt at different models and watch what they do, with real metrics attached, so I'm comparing evidence and not vibes.

Everything sits behind a tiny protocol I call LLMProvider: a display name, an isReady check, a prepare() call, and a stream(prompt:options:) method that hands back text as it arrives. That's the whole contract. Apple's model gets an AppleFoundationProvider wrapping SystemLanguageModel and LanguageModelSession. Open models get an MLXProvider on Apple's MLX framework, pulling quantized weights and running them on the GPU. Both conform to the same protocol, so the rest of the app neither knows nor cares which one it's talking to, and adding a model is one line in a registry. I keep a small bench of open models around, Qwen and Llama and earlier Gemma builds, purely as comparison. Apple's model is the free baseline everything else has to beat to earn its download. This piece is about Apple's model. The cross-model bake-off is a story for another day.

Two bits of plumbing do the quiet work. A memory probe samples the app's footprint on every run, and a live gauge in the top bar shows it against the device's budget, going amber then red as it climbs. A metrics helper records wall-clock time and time to first token. And every run gets appended to a runs.jsonl file on the device, one JSON object a line, so there's a durable record I can pull off later. That last part matters more than it sounds. The most interesting failures are exactly the ones you'd otherwise scroll past and forget.

What it tests

Three tabs, mapped to the three jobs I actually need a model to do.

The first, the bake-off, is general chat and writing. The prompt library is built to find weak spots, not to admire the prose: a summary capped at 2 sentences and 40 words, to see if the model respects a hard limit; the old bat-and-ball question, where the tempting answer is the wrong one; a strict JSON-only format test; a long-form persona test; a code test; and a hallucination probe that asks for the population of a town that doesn't exist. Then a graded ladder of meeting summaries, from a short excerpt up to a full, messy hour with the filler and crosstalk left in. The hardest version hands the model a background briefing alongside the transcript and asks it to summarize only what was actually said in the room. That one turned out to matter.

The second tab, the meeting cockpit, tests live summarization for real. It loads a transcript, replays it paragraph by paragraph at a speed I can dial up, and sends a batch to the model every 10 paragraphs or 30 seconds. Each batch is summarized on its own, so the model never sees the whole meeting at once. This leans hard on guided generation, where you hand the model a typed Swift struct and it fills it in.

The third tab, tools, tests function calling. I wrote 5 fake tools, weather, a calculator, a knowledge lookup, a clock and a contact lookup, all returning canned, deterministic results. The model isn't graded on whether the weather is right. It's graded on what it calls, with which arguments, and whether it uses the result it gets back. The prompts fall into 5 buckets: selection, restraint, parameter extraction, composition and integration. Restraint is the one people forget. Knowing when to call nothing is a skill in itself. Keep that one in mind.

How I run it

The rig is built for looking, not for automated scoring. I pick a model, fire a preset or a prompt of my own, and read the trace. On the tools tab the trace is the whole point: every call, the exact arguments, the result handed back, the final answer, one tap to copy a run out. On the cockpit, the summary builds line by line with timestamps, next to a count of batches that got skipped, which became a story all of its own. The bake-off runs two models side by side on the same prompt.

One deliberate choice on the Apple side: every request runs in a fresh LanguageModelSession, nothing carried over. There's a practical reason as well as a tidy one. The on-device model's context window is small, so reusing a session means each earlier turn keeps eating the budget until your prompt starts getting truncated. Start fresh and every run gets the full window and a clean slate. One run can't contaminate the next, and I never end up blaming the model for a wedged session that was really my own fault.

Why bother

The keynote demo is genuinely impressive, and some of it is even real. But a demo is a sales pitch with better lighting. Almost nothing I learned from running real meetings through this rig is in the documentation. The keynote tells you what a model can do. The rig tells you what it does when nobody's watching. That gap, between the advertised capability and the behavior you actually see, is where responsible integration lives, and you don't get to it by reading the spec. You get to it by building the harness and watching.

This post was about building that harness. The results come next: what actually happened when I pushed real meetings through it, including the one finding I'd have bet money on that turned out to be wrong. Follow along.

Responsible LLM Use

The rig

What it tests

How I run it

Why bother