April 2026

By Davide Ciffa

DFlash on ggml: up to 207 tok/s Qwen3.5-27B on a RTX 3090

We built a standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with a DFlash block-diffusion draft. The demo video shows a 207.6 tok/s run (5.46x over AR); the HumanEval 10-prompt bench averages 129.5 tok/s at DDTree budget=22, single RTX 3090, 24 GB. 3.43x over autoregressive and 2.8x over the best public SGLang AWQ number.

DFlash on ggml hero DFlash on ggml live demo

TL;DR

Why the experiment exists

Qwen3.5-27B is a hybrid model: every 4th layer is full softmax attention, the rest (48 of 64) are Gated DeltaNet. M-RoPE with dimension sections [11, 11, 10, 0], 24 Q heads, 4 KV heads, key/value length 256, and an SSM state cache alongside the usual KV cache.

That combination doesn't have a good single-3090 decode path today:

We wanted the fastest single-3090 decode we could get on a 24 GB consumer card. The answer turned out to be: port only the graph glue to ggml, keep the DeltaNet kernel that already exists, run DFlash block-diffusion draft with a DDTree verifier, and compress the KV cache to Q4_0 for long context.

Architecture

The library is hardcoded for one model pair:

RoleModelSize
TargetQwen3.5-27B-Q4_K_M.gguf~16 GB
Draftz-lab/Qwen3.5-27B-DFlash3.46 GB bf16

Greedy verify, block size 16, CUDA only, single RTX 3090. The build links libggml*.a and nothing from libllama. If you deleted deps/llama.cpp/src/ it would still compile.

Layout

include/dflash27b.h          Public C API
src/gguf_target_loader.cpp   Q4_K_M qwen35 weights -> ggml tensors
src/safetensors_draft.cpp    bf16 DFlash draft weights -> ggml tensors
src/qwen35_target_graph.cpp  Hybrid forward + feature capture
src/qwen3_dflash_graph.cpp   5-layer non-causal DFlash draft graph
src/delta_net_chunked.cpp    Chunked Gated DeltaNet wrapper
src/kv_cache.cpp             Rolling target_hidden buffer
src/f16_convert.cu           F32/F16 conversion helpers
test/test_dflash.cpp         Prefill -> draft -> verify -> accept driver
test/test_vs_oracle.cpp      Numerics checks vs PyTorch oracle
examples/chat.py             Multi-turn CLI + OpenAI-compat server driver

Oracle: the PyTorch reference at megaqwen3_27b_dflash/reference/dflash_reference.py, cross-checked against the z-lab AutoModel forward at cos sim 0.999812.

From autoregressive to DDTree

Configurations on the same 10-prompt HumanEval bench, n_gen=256, RTX 3090, Q4_K_M target, bf16 draft. Rows 1–5 are the historical tuning sweep (commit f1cb9bf, AR baseline 37.44 tok/s). Row 6 is the fresh 2026-04-20 run on commit 5bb7f8c (AR baseline 37.78 tok/s). Speedup is computed against each row's contemporaneous AR:

ModeMean ALMean tok/sMaxSpeedup
Autoregressive (historical, 37.44)1.0037.44451.00x
Chain DFlash7.67112.82150.063.01x
DDTree budget 20 (f32 inter)8.44127.93160.363.42x mean / 4.28x max
DDTree budget 22 (f32 inter)8.77130.35171.383.48x mean / 4.58x max
DDTree budget 20 (f16 inter)8.64133.91171.683.58x mean / 4.59x max
DDTree budget 22 (f16 inter, historical peak)8.88135.8159.73.63x mean / 4.26x max
DDTree budget 22 (fresh run, 2026-04-20)8.31129.52158.403.43x mean / 4.20x peak

AL = average accept length (tokens accepted per verify step). DDTree paper reports +35–42% over chain DFlash on pure-attention Qwen3-4B/8B/30B-MoE (A100/B200, BF16). On our hybrid Q4_K_M/RTX 3090 combo we see +15% over chain. We think the gap comes from Q4 quantization flattening the draft softmax, which we partially patched with a chain pre-seed in build_ddtree.

Draft-accuracy ceiling. Budget sweep at 20/30/40 with f16 intermediate cache plateaus AL at ~8.9. Budget 30 gives AL 8.86 (120.49 tok/s), budget 40 gives AL 8.90 (105.10 tok/s). We are draft-ceiling bound, not verify-memory bound: a bigger tree would not help, only a better draft would.

Key wins (day-by-day log, condensed)

128K context on 24 GB

Flash-attention in ggml-cuda supports Q4_0 K+V natively, so the KV cache compression is just ggml_cpy with the built-in F32→Q4_0 quantizer on write. 8x compression over f16.

Combined with a rolling 4096-slot target_feat ring (wrap-around writes, split reads across the wrap boundary), target_feat shrinks from 6.6 GB to 0.2 GB at 128K. One binary, env-selectable:

DFLASH27B_KV_Q4=1
max_ctx = 131072
DRAFT_CTX_MAX = 2048
DFLASH27B_PREFILL_UBATCH = 16
--ddtree-budget = 16
Prompt lengthPrefillDecode
HE 10-prompt (ctx=131072)n/a134.78 tok/s (AL 8.33)
13K tokens, n_gen=6442 s99 tok/s
32K tokens, n_gen=64106 s35 tok/s

Tradeoffs: Q4_0 KV costs ~3% quality on HE (AL 8.56 -> 8.33) at short context and is dramatically better at long context. It is the only thing that lets 128K allocate at all on 24 GB.

Prefill

What comes next


Source: github.com/Luce-Org/lucebox-hub (open source, MIT). Cross-validated against the PyTorch oracle at cos sim 0.999812. Numbers above are from test_dflash on HumanEval 10-prompt bench, RTX 3090 Ampere sm_86, Q4_K_M target, bf16 draft, greedy verify.

Related

DFlash on ggml, open source.

Qwen3.5-27B, single RTX 3090, up to 207 tok/s, 128K context.

GitHub Megakernel Post Discord