May 2026

By Davide Ciffa and Erik LaBianca

Gemma 4 26B edges out DeepSeek V4 Flash (284B) on ds4-eval-92, at 5x the speed

ds4-eval-92 head-to-head. A 24 GB RTX 5090 Laptop running Gemma 4 26B ties DeepSeek V4 Flash (284B) on a 192 GB Mac, edges it on the same box, and decodes about 5x faster.

Gemma 4 26B on an RTX 5090 Laptop next to DeepSeek V4 on a MacBook

We expected the 192 GB Mac to have the edge: far more memory, a far bigger model. It didn't pan out. On ds4-eval-92, Gemma 4 26B (26B params, 4-bit, in 24 GB) tied DeepSeek V4 Flash (284B params, run at ~2-bit to fit the Mac) at 78.3%, edged it to 79.3% on that same Mac, and decoded about 5x faster.

TL;DR

The benchmark

ds4-eval-92 comes from antirez/ds4, Salvatore Sanfilippo's DeepSeek V4 Flash engine, the same one we're running on the Mac here. We ported its 92-case ds4_eval.c set into luce-bench; the provenance, grading, and request rules are in our Running the benchmarks intro to luce-bench. It's adversarial enough that nothing we ran broke ~82%, so these differences are small and earned token-for-token.

What we compared

Local Gemma 4 26B (26B total, ~4B active, Q4_K_M in 24 GB; RTX 5090 Laptop, lucebox + DFlash) against local DeepSeek V4 Flash (284B total, 13B active, run at ds4's ~2-bit IQ2_XXS to fit; 192 GB Mac Studio, M2 Ultra, 60-core GPU, native ds4_server). Each at the size, quant, and mode you'd use to run AI models locally, same eval set, same max_tokens policy, same scoring. We run Gemma in nothink; DeepSeek runs in think and lands ~78% in every mode and provider we tried, so the comparison doesn't hinge on its mode.

Head to head

Each model at the size, quant, and mode you'd run locally:

Model Where Mode Pass Accuracy Wall (med) tok/s
Gemma 4 26B (26B, 4-bit) RTX 5090 Laptop, 24 GB nothink 72/92 78.3% 9.8 s 101.4
DeepSeek V4 Flash (284B, ~2-bit) Mac Studio, 192 GB think 72/92 78.3% 144.8 s 20.7

Accuracy

ds4-eval-92 accuracy, nothink Gemma vs think DeepSeek. Both 78.3% (72/92).

It's close. On the laptop in nothink it's a dead heat, 78.3% each (72/92). On the same Mac via MLX, Gemma noses ahead (79.3%); in think, by three (81.5%, one seed). So a 26B model edges a 284B one on this set. State the qualifier plainly: to run on the Mac at all the big model is squeezed to ~2-bit, while Gemma runs at 4-bit, so part of the story is that 284B at 2-bit doesn't pull away from 26B at 4-bit here. DeepSeek V4 Flash is a strong model and this is one benchmark; the point isn't that Gemma is better, it's that a model an eleventh the size kept pace, far more efficiently.

Throughput

Decode throughput, local. Gemma 4 26B (nothink) on RTX 5090 Laptop = 101.4 tok/s, DeepSeek V4 Flash (think) on Mac 192 GB = 20.7 tok/s.

On the same prompts, the laptop Gemma run decodes about 4.9x faster (101.4 vs 20.7 tok/s), and because nothink Gemma answers in far fewer tokens, a median answer lands in 9.8 s against 144.8 s. The Mac isn't slow because it's a Mac. It's slow because DeepSeek V4 Flash reads far more weight per decoded token than Gemma's a4b MoE does, and Q-style DeepSeek decode on the Mac doesn't get the matmul throughput that Q4_K_M weights through DFlash get on dedicated VRAM.

Why the laptop is so fast

A 192 GB Mac has 8x the memory and runs a model 11x the size, so the result looks backwards. We think three things explain it, plus a fourth that doesn't.

Same Mac, smaller model

The laptop makes the speed gap dramatic, but it isn't what wins on quality. We also ran Gemma 4 26B on the same 192 GB Mac Studio through Apple MLX (8-bit, nothink): 79.3% on ds4-eval-92, a point above DeepSeek V4 Flash's 78.3% on that box, and at ~40 tok/s against 20.7. On identical hardware, each model on its own local stack, the small MoE held its own and then some. This doesn't crown Gemma the better model; it shows that a 4B-active MoE can match a much larger one on this eval while costing far less to run. (Caveat: different engines and quants, MLX 8-bit vs ds4_server, and Gemma's nothink runs about level with its think.)

A note on serving

Gemma 4 26B is sensitive to how it's served: the same model on OpenRouter scored 73.9%, about 5 points under our local lucebox serve, while DeepSeek holds its ~78% across providers.

Takeaway

DeepSeek V4 Flash is a quality model, and this isn't a knock on it. But on ds4-eval-92 a 24 GB laptop running Gemma 4 26B matched-to-beat it on accuracy and ran ~5x faster, and even on the same Mac the smaller model held its own. Read it this way: a small MoE that happens to be strong on your workload can deliver that quality at a fraction of the memory and latency. Find the model that does well on your tasks, then size the hardware to it.


ds4-eval-92 from antirez/ds4 (MIT), run via luce-bench, single seed. Hardware: 24 GB RTX 5090 Laptop vs 192 GB Mac Studio (M2 Ultra, 60-core GPU), local serving. Project: github.com/Luce-Org/lucebox-hub.

Related

Run Gemma 4 26B locally with lucebox

DFlash + Q4_K_M on a single 24 GB GPU. Get the numbers in this post.

GitHub Models Discord Compare hardware