Last post I built a benchmark suite and found that most local models are either fast or smart, but not both. The problem with those benchmarks: they were short. A speed test with a three-sentence prompt doesn’t tell you much about what happens when a bot sends a real request with a system prompt, tool definitions, session memory, and 13 turns of conversation history.
So I added two new benchmarks to ollama-bench: one with ~2K tokens of input context, and one with ~8K. Then I ran all 14 models through the full suite.
The results changed my model rankings.
What the Benchmarks Simulate
The original benchmarks tested models in isolation: here’s a short prompt, give me a response. That’s useful for baselines, but it’s not how my bots actually work.
A real OpenClaw bot request looks more like this:
- A system prompt defining personality, rules, and infrastructure context
- Tool definitions (web search, code execution, file I/O, browser automation)
- Session memory from previous conversations
- The actual conversation — sometimes 10+ turns deep
The context_2k benchmark simulates a lightweight version of this: a minimal system prompt, three tool definitions, and five conversation turns. About 2,000 tokens before the model even starts generating.
The context_8k benchmark simulates a full bot session: comprehensive system prompt, eight tool definitions, detailed session memory, and 13 conversation turns. Around 8,000 tokens of input. This is close to what Bob actually processes on a busy Discord exchange.
The Speed Tax
Here’s what happens to generation speed as context grows. Same hardware, same models, same day.
| Model | Size | Speed (short) | Speed (2K ctx) | Speed (8K ctx) | 8K wall time |
|---|---|---|---|---|---|
| llama3.2:1b | 1B | 38.8 tok/s | 34.6 tok/s | 27.6 tok/s | 22s |
| llama3.2:3b | 3B | 23.7 tok/s | 14.8 tok/s | 12.4 tok/s | 50s |
| gemma3:4b | 4B | 15.1 tok/s | 10.7 tok/s | 10.5 tok/s | 51s |
| phi4-mini:3.8b | 3.8B | 21.2 tok/s | 16.7 tok/s | 12.4 tok/s | 46s |
| qwen2.5-coder:7b | 7B | 11.2 tok/s | 8.5 tok/s | 6.7 tok/s | 66s |
| qwen3-coder:30b | 30B | 12.6 tok/s | 11.1 tok/s | 8.8 tok/s | 59s |
| lfm2:24b | 24B | 16.4 tok/s | 23.9 tok/s | 17.9 tok/s | 41s |
| gemma3:12b | 12B | 4.4 tok/s | 5.0 tok/s | 4.4 tok/s | 108s |
| phi4:14b | 14B | 4.9 tok/s | 5.1 tok/s | 4.5 tok/s | 139s |
| qwen3:14b | 14B | 4.2 tok/s | 4.4 tok/s | 3.4 tok/s | 248s |
| gemma3:27b | 27B | 2.3 tok/s | 2.0 tok/s | 2.6 tok/s | 218s |
| qwen3-coder-next | 30B | 3.7 tok/s | 4.0 tok/s | 3.9 tok/s | 111s |
Hardware: server01 — dual Xeon E5-2695 v4 @ 2.10GHz, 256GB DDR4, no GPU.
The pattern: small models lose the most. llama3.2:3b drops from 23.7 to 12.4 tok/s — a 48% speed hit going from a short prompt to 8K context. llama3.2:1b drops 29%. The models that were already slow barely move because they were already bottlenecked on compute, not context processing.
The Quality Picture
Speed only matters if the model can actually use all that context. Here’s how quality holds up:
| Model | Size | Short score | 2K score | 8K score |
|---|---|---|---|---|
| qwen3-coder:30b | 30B | 100 | 100 | 100 |
| lfm2:24b | 24B | 100 | 100 | 100 |
| gemma3:27b | 27B | 100 | 100 | 100 |
| gemma3:12b | 12B | 100 | 100 | 100 |
| phi4:14b | 14B | 100 | 100 | 100 |
| qwen3:14b | 14B | 100 | 100 | 100 |
| qwen3-coder-next | 30B | 100 | 100 | 100 |
| qwen2.5-coder:7b | 7B | 100 | 100 | 100 |
| phi4-mini:3.8b | 3.8B | 100 | 70 | 70 |
| gemma3:4b | 4B | 100 | 70 | 100 |
| llama3.2:3b | 3B | 100 | 100 | 80 |
| llama3.2:1b | 1B | 100 | 70 | 80 |
| functiongemma:270m | 270M | 100 | 70 | 70 |
(Short score here is the speed benchmark — all models score 100 on raw generation. The 2K and 8K scores test whether the model can give a coherent, contextually grounded answer using the provided information.)
The 7B+ models all handle context perfectly. Below that, cracks appear. phi4-mini and gemma3:4b stumble on the 2K benchmark — they can generate text just fine, but struggle to synthesize information spread across a system prompt and conversation history. llama3.2:3b holds at 2K but drops to 80 at 8K, which matches what I see in production: it occasionally misses details from earlier in a conversation.
The Standout: lfm2:24b
This model wasn’t in my last post because it didn’t exist yet. Liquid Foundation Model 2 dropped a few days ago and it’s genuinely surprising.
At 24B parameters it generates at 17.9 tok/s on 8K context — faster than llama3.2:3b at 12.4 tok/s on the same context. It scores 100 across every benchmark. It handles tool calls. The 8K wall time is 41 seconds, which is less than half of what llama3.2:3b takes.
The architecture does something different from traditional transformers that apparently scales much better on CPU. I don’t fully understand the internals yet, but the numbers speak for themselves: it’s the only model that got faster under context load compared to what you’d expect from its parameter count.
I’m testing it as a replacement for llama3.2:3b on Bob this week.
What This Changes
My previous conclusion was: llama3.2:3b for real-time, gemma3:12b for async, everything else is too slow. The context benchmarks shift that:
llama3.2:3b is slower than I thought. The speed test showed 23.7 tok/s and I built my expectations around that. In practice, with a real bot prompt, it’s running at 12.4 tok/s and taking 50 seconds for a response. That’s the difference between “snappy” and “Discord shows the typing indicator for a while.”
lfm2:24b is the new default candidate. Faster than llama3.2 under real context, perfect quality scores, tool use works. If it holds up in production, the 3B model drops to a fast fallback.
qwen3-coder:30b is the async workhorse. Perfect scores everywhere at 8.8 tok/s under 8K context. A minute per response is fine for code review jobs and background analysis. It replaces gemma3:12b in that role because it handles tool calls — gemma3 still can’t.
Context length is a real cost. When I’m optimizing bot prompts, every unnecessary paragraph in the system prompt is eating into generation speed. The benchmarks made this concrete: trimming a bot’s system prompt from 8K to 2K could nearly double response speed on the smaller models. That’s a real optimization target.
The GPU Update
Last post I was debating between a Tesla P4 ($80-125) and a T4 ($500+). The context benchmarks pushed me to a decision: I ordered three P4s.
Total came to just under $300 for everything — the cards, fans, shrouds, and a cable setup to connect the fans since the R630 doesn’t have standard GPU fan headers. Hopefully all of that works when it arrives. Three P4s gives me 24GB of VRAM total, which is enough to fully offload lfm2:24b or run multiple smaller models simultaneously.
The T4s were tempting at 16GB VRAM each, but at over $600 per card the math didn’t add up — especially three of them. From what I’ve been reading, GPU memory prices aren’t expected to come down this year, so waiting for a better deal isn’t a real strategy. The P4s get me in the game now.
A GPU doesn’t just speed up generation — it speeds up prompt processing. The 8K context benchmark spends a significant chunk of wall time just ingesting the prompt before generation starts. A GPU would compress that phase dramatically. I’m expecting the biggest gains on exactly the benchmarks where CPU performance falls apart: the context-heavy ones.
When the cards arrive, I’ll rerun the full suite and find out which models cross the threshold from async-only to real-time usable.
Running 14 models through 12 benchmarks on CPU takes about 4 hours. That’s the kind of patience local inference requires right now. But the data is worth it — I now know exactly where each model breaks down, and “under real context load” turns out to be a very different answer than “on a three-sentence prompt.”