What Context Length Actually Costs on CPU

Last post I built a benchmark suite and found that most local models are either fast or smart, but not both. The problem with those benchmarks: they were short. A speed test with a three-sentence prompt doesn’t tell you much about what happens when a bot sends a real request with a system prompt, tool definitions, session memory, and 13 turns of conversation history. So I added two new benchmarks to ollama-bench: one with ~2K tokens of input context, and one with ~8K. Then I ran all 14 models through the full suite. ...

February 26, 2026 · 6 min · Warren Parks

Local Models Are Exciting. My CPU Is Not.

The appeal of running your own language models is real: no API costs, no rate limits, no data leaving your network, and a fallback chain that still works when a cloud provider has an outage. I’ve been chasing that for a while. This week I finally sat down and measured what I actually have. The short version: the potential is there. The hardware isn’t. Yet. Moving Ollama to the Server I’d been running Ollama on my desktop. The problem with that is obvious once you think about it — the desktop sleeps, reboots, and isn’t shared. If a bot wants to use a local model at 3am, it’s out of luck. ...

February 24, 2026 · 6 min · Warren Parks