It started, as these things do, with a shortcut I was certain would work.
I’ve been building SwiftAgents, my Swift framework for talking to language models, and one of the local providers it supports is LM Studio — the app a lot of us reach for to run models on our own Macs. LM Studio recently grew support for the newer “Responses” API, the OpenAI-style endpoint that can remember a conversation for you. Instead of re-sending the whole chat history on every turn, you send only the new message plus a little breadcrumb — previous_response_id — that tells the server “you already remember the rest.” Less data over the wire, less bookkeeping on the client. An obvious win, and I wanted it in SwiftAgents.
Before wiring it in for good, I asked Claude Code to benchmark it. Ten turns of the same little conversation, run two ways: once with the new chaining trick, and once the old-fashioned way where you resend the entire history every single time. I just wanted to confirm the clever path was faster before committing to it.
The numbers came back backwards.
When the shortcut is the long way
Here is what the benchmark found, running a small Qwen3 model inside LM Studio. The left column is the “optimization” — chaining with previous_response_id, sending only the new message each turn. The right column is the brute-force approach — resending the entire conversation, every time, like a caveman.
The number shown is how many input tokens the server actually had to process on that turn:
| Turn | Chaining (only the new message sent) | Full resend (whole history every time) |
|---|---|---|
| 1 | 26 | 26 |
| 2 | 48 | 48 |
| 3 | 98 | 69 |
| 4 | 206 | 95 |
| 5 | 415 | 120 |
| 6 | 829 | 141 |
| 7 | 1,669 | 169 |
| 8 | 3,338 | 191 |
| 9 | 6,677 | 211 |
| 10 | 13,364 | 238 |
Read it twice, because I had to. The wasteful approach — resending everything — keeps the workload flat, around 240 tokens by turn ten. The clever approach, where I send almost nothing, somehow makes the server grind through thirteen thousand.
And look at the shape of that left column: 26, 48, 98, 206, 415, 829… it doubles every turn. A textbook geometric balloon. Whatever the server does internally when it “remembers” the conversation for you, it rebuilds the whole thing roughly twice as large each time. Since the model has to read all of those tokens before it can say a word, the wait balloons right along with the token count. By turn ten a single reply took 28 seconds with chaining, against 3 seconds without.
The optimization was, comfortably, the slowest possible way to hold the conversation.
Making sure it wasn’t just me
A result that silly deserves suspicion, so the next step was to check whether I’d misconfigured something or stumbled onto one bad model. The first idea was to run the benchmark against official GPT 5.5 – and there the caching behaved exactly as you’d expect. Then I asked Claude Code to run the same probe across a number of LLMs I had previously downloaded.
The balloon showed up every single time — small models and large, old architectures and brand-new ones, the plain ones and the fancy “reasoning” ones, and even a mixture-of-experts model. Same fingerprint each time: the chained path doubles every turn, the full-resend path stays flat.
A few of the more memorable data points:
- gpt-oss (a 20-billion-parameter mixture-of-experts model): ballooned to 16,833 tokens by turn ten — for a conversation that was genuinely 283 tokens long. That’s a 59× tax. The lovely irony here is that this model barely “thinks” out loud at all, yet it scored the worst blowup of the lot, which told us the bug has nothing to do with how much the model generates and everything to do with how the server rebuilds the history.
- A 12-billion Gemma model: by turn ten, a single reply took 37.6 seconds instead of the ~2.6 seconds the same conversation needed over the plain chat endpoint.
Importantly, this isn’t the Responses API being a bad idea, and it isn’t LM Studio being bad software — its ordinary chat endpoint is quick and caches beautifully. It’s one specific feature, the server-side conversation reconstruction behind previous_response_id, that misbehaves. I know it’s specific to LM Studio because the obvious points of comparison don’t do it: OpenAI’s own servers keep the token count equal to the real conversation, and Ollama — which simply declines to be stateful — keeps it flat too. Only LM Studio’s reconstruction inflates.
So rather than ship a feature that makes things slower, I did the boring, correct thing in SwiftAgents: on LM Studio it resends the full history and skips the chaining entirely. And I wrote the whole thing up, with a runnable reproduction script, as a bug report on LM Studio’s tracker. Sometimes the deliverable is a paper trail.
A side quest: the app I loved versus the one I didn’t
Somewhere in the middle of all this benchmarking, a different question crept in.
I’ve always preferred LM Studio. It’s the better-looking app, it feels more modern, and — the reason that actually mattered to me — it supported MLX, Apple’s on-device machine-learning framework, long before Ollama did. On Apple Silicon, MLX is the fast path, so for a good while LM Studio was simply the quicker way to run a model on a Mac. Ollama was the command-line workhorse I respected but didn’t reach for.
While poking at Gemma 4, I noticed Ollama had quietly closed that gap — it now runs the same modern, accelerated model formats I’d switched to LM Studio for in the first place. Which meant, for the first time, I could put the two of them on a truly level playing field: the same model, in the same quantization, and just race them.
So I did. Here’s Gemma-4-E4B, identical nvfp4 build on both:
| Ollama | LM Studio | |
|---|---|---|
| Reading your prompt (prompt processing) | 910 tok/s | 445 tok/s |
| Writing the answer (generation) | 62.7 tok/s | 51.7 tok/s |
| Time until the first word appears | 72 ms | 121 ms |
| Re-reading a 1,780-token prompt it just saw (warm cache) | 65 ms | 657 ms |
Ollama wins every row. It reads prompts twice as fast, generates noticeably quicker, starts answering sooner, and — the one that surprised me most — reuses its cache about ten times more cheaply. Ask it to re-read a prompt it just processed and it’s done in 65 milliseconds; LM Studio takes the better part of a second to do the same thing.
I want to be fair, because there’s an honest caveat buried in here. The first time I raced them I had LM Studio on MLX and Ollama on the older format, and in that mismatched setup LM Studio’s generation looked faster. It was a trap — I was comparing the fast format against the slow one. The moment I matched them quant-for-quant, the apparent win evaporated and Ollama pulled ahead on everything. So I won’t claim Ollama is universally faster at everything for everyone; I’ll claim the thing my data actually supports, which is that on the same model in the same format, Ollama came out ahead everywhere I looked.
That’s a slightly uncomfortable conclusion for me, given how much I liked the other app. But the stopwatch doesn’t care what’s prettier.
The part I keep thinking about
Here’s the bit that genuinely tickles me, and it’s not really about tokens at all.
I didn’t write any of these benchmarks. I described what I wanted to know — “load a model, run ten turns each way, track the response time” — and Claude Code wrote the Python, ran it and computed all the statistics. When it needed a model that wasn’t loaded, it drove LM Studio’s command-line tool to load it, checked the API to confirm it was really resident, and benchmarked it.
At one point it quoted a generation speed that looked too good, paused, decided the measurement window had been too short to trust, rewrote the benchmark to generate a longer sample, and re-ran it to get an honest number. It even filed the bug report on my behalf. You can see how additional info was added as comments as I was discovering more data.
At the same time my agentic CI loop was ticking as well on the SwiftAgents PR. When the pull request’s continuous-integration build went red on Linux — because a type I’d used lives in a different module off the Mac — it diagnosed the failure, reached for my own SwiftCross shim to fix it, pushed, watched the build, found a second spot with the same problem, fixed that too, and waited with me until all six platforms went green. I mostly watched.
A few months ago, writing a benchmark harness by hand would have been too much work for me. So I wouldn’t have done this research, but I would have just complained on Twitter about another problem in somebody else’s code. And I would have been frustrated that I couldn’t do anything about it. In this new reality agents do the research, the write-up and the filing of the issue. The ball is now in LM Studio’s court. This new reality still feels faintly like cheating.
I put the benchmarking scripts in gist for reference.
What I changed
Two things came out of an afternoon that was only ever meant to confirm a one-line optimization.
SwiftAgents now does the sensible thing on LM Studio: it resends the full conversation and leaves previous_response_id chaining well alone until the underlying balloon is fixed. The “optimization” stays on the shelf.
And on my own machine, my default has quietly shifted from the app I liked to the one that’s faster. I still think LM Studio is the nicer thing to look at. But I’ve been doing this long enough to know that when the numbers are that consistent, you go where the numbers point — even when they point somewhere you didn’t expect, and even when an AI is the one holding the stopwatch.
Do you use any local inferencing? If so, which do you prefer?

Categories: Bug Reports