May 2026
Laguna XS.2 on a 3090: 111 tok/s, 5.4x prefill, first MoE target for PFlash
Poolside released Laguna XS.2, a 33B-A3B MoE model. We ported it into Luce Dflash + PFlash, making Laguna the first MoE target supported by PFlash. Result: ~107 tok/s decode at short context and 15.91 s TTFT at 128K on a single RTX 3090, 5.4x faster prefill than llama.cpp. A fast local model on consumer hardware, with a 128K context window that no longer costs minutes to fill.
TL;DR
- First MoE target in PFlash. PFlash now compresses long context for sparse-MoE models, not just dense. Laguna's 256-expert top-8 routing, per-layer head counts, partial RoPE with YaRN, and sliding-window attention all flow through the existing PFlash pipeline unchanged.
- Week-1 port. Laguna XS.2 (33B-A3B) running inside dflash on a 3090 within ten days of release. Hand-rolled CUDA, ggml-only, no libllama dependency.
- Decode. ~107 tok/s greedy on RTX 3090 at short context (bench_laguna_generate, n_gen=128). Drops with context: ~97 tok/s at 1K, ~59 tok/s at 4K. Autoregressive only until a Laguna spec-decode draft ships.
- Long-context prefill. 128K TTFT at 15.91 s with PFlash, vs 86.60 s for llama.cpp pp on the same machine. 5.4x speedup. Dense dflash OOMs at 128K. PFlash does not.
- NIAH passes at every measured (context, keep) pair from 256 to 131K, including aggressive keep=0.10 at 64K and 128K.
- Use case. Fast local model. PFlash compresses 128K of context in 16 s; the target then decodes at ~100 tok/s for short replies, on a single 24 GB GPU.
Where Laguna fits in the local-coder lineup
For most local-coding workloads on a 24 GB GPU, the choice is between Qwen3.6 35B-A3B (the current open MoE benchmark leader on hard agentic tasks) and a smaller / faster runner-up. Laguna XS.2 is a competitve pick: same MoE class (33B-A3B vs 35B-A3B, both 3B active), Apache 2.0, fully open training stack from Poolside. For the light end of the workload, Laguna is a viable alternative.
Where we would reach for Laguna over Qwen3.6 35B-A3B today:
- Lighter coding tasks where short replies and a clean MoE forward are enough; ~107 tok/s decode is plenty.
- Multilingual codebases. Strong open MoE coder for non-English identifiers, comments, and issue trackers; this is where Poolside's mix shines.
- Long-context Q&A over 128K context, where PFlash compresses the prompt in 16 s and the target then decodes in seconds.
The angle that matters for the easier workloads is throughput-per-cost. PFlash compresses 128K of context in 16 s, then the target decodes at ~107 tok/s at short reply lengths on a single 24 GB GPU. Long-context prompts that used to be a coffee break are now a few seconds of waiting.
Other places it earns a slot in the local lineup:
- Multilingual codebases. Strong open MoE coder for non-English identifiers, comments, and issue trackers.
- License clarity. Apache 2.0, no field-of-use restrictions, no acceptable-use policy attached.
- Lab diversity. Different training data and RL recipe than the Qwen family. Useful as a routing target or an ensemble tail.
- Independent open-weight release. Poolside trains from scratch on their own stack. Running the model is one way to keep that pipeline alive.
What it took to run Laguna in Lucebox-hub
Laguna XS.2 is not architecturally vanilla. The loader and forward graph had to handle several non-standard features:
- MoE routing. 256 experts, top-8 per token, sigmoid router with score-correction bias, sum-norm and a 2.5x scale, plus an always-on shared expert.
- Per-layer head count [48, 64, 64, 64] x 10 layers. Not constant across the stack.
- Partial RoPE with YaRN on the full-attention layers, plus per-layer RoPE type.
- Sliding-window attention with a proper two-mask design (causal AND sliding window, not one or the other).
- Per-head softplus attention gate.
The result is a roughly ~2.9K-node ggml graph, no libllama dependency, hand-rolled CUDA only. The loader places 678 tensors at 18.77 GiB on the GPU plus 110 MiB of token embeddings on the CPU. Fits comfortably on a 24 GB card alongside a Q4_0 KV cache for 128K context.
Numbers on a single RTX 3090
Time to first token, Q4_K_M weights
| Context | KV | dense dflash | PFlash dflash | llama.cpp pp | PFlash vs llama.cpp |
|---|---|---|---|---|---|
| 4 096 | Q8_0 | 0.82 s | 0.56 s | 1.73 s | 3.1x |
| 16 384 | Q4_0 | 3.73 s | 2.54 s | 8.81 s | 3.5x |
| 65 536 | Q4_0 | 23.50 s | 6.35 s | 32.85 s | 5.2x |
| 131 072 | Q4_0 | OOM | 15.91 s | 86.60 s | 5.4x |
Dense dflash OOMs at 128K. PFlash compresses the prompt with a Qwen3-0.6B drafter using block-sparse attention scoring, then hands the compressed token stream to the Laguna target. Cross-tokenizer round-trip uses byte-level BPE plus a word-boundary recovery pass that pulls dropped sub-token fragments back into the kept set.
Decode throughput
Measured on RTX 3090 with bench_laguna_generate (Q4_K_M target, default KV, n_gen=128, greedy):
| Prompt ctx | Decode tok/s |
|---|---|
| 128 | 107.4 |
| 1 024 | 97.1 |
| 4 096 | 59.0 |
Decode is autoregressive (single token per forward) until a Laguna spec-decode draft model is published. The dflash daemon's draft-loaded path is reserved for that drop-in; on Qwen3.5/3.6-27B the same machinery delivers a 3.4–3.75x speedup, so a future Laguna draft should land Laguna in the same range.
NIAH retrieval, depth 0.5, BLUEHORIZON-7421 needle
| Context | KV | keep | drafter | target prefill | end-to-end TTFT | NIAH |
|---|---|---|---|---|---|---|
| 16 384 | Q8_0 | 0.10 | ~1.5 s | ~3 s | ~4.5 s | PASS |
| 65 536 | Q4_0 | 0.10 | ~5 s | ~6 s | ~11 s | PASS |
| 65 536 | Q4_0 | 0.30 | ~5 s | ~10 s | ~15 s | PASS |
| 131 072 | Q4_0 | 0.10 | 11.11 s | 4.79 s | 15.91 s | PASS |
| 131 072 | Q4_0 | 0.20 | 11.20 s | 13.55 s | 24.75 s | PASS |
| 131 072 | Q4_0 | 0.30 | 11.41 s | 26.43 s | 37.84 s | PASS |
Every (context, keep) point passes including the previously failing 64K+ at keep=0.10. The earlier failure was not a drafter bug. The cross-tokenizer step was truncating multi-token needles at PFlash chunk boundaries: BLUEH survived but ORIZON-7421 got dropped when its tokens fell in low-importance chunks. The fix is a word-boundary expansion pass that pulls partial-word fragments back into the kept set before decoding.
OpenAI-compatible server, with sampling
Same server.py as qwen35. Point --target at the Laguna GGUF and the binary detects arch=laguna from the metadata, then routes to run_laguna_daemon. You get the existing FastAPI surface: /v1/chat/completions (stream and non-stream), /v1/models, /health, CORS, prefix cache, prefill cache. Sampling parameters from the request body forward through to a CPU sampler chain on the daemon side. No new server, no new code path.
Smoke test, prompt = "Tell me a one-line haiku about clouds." on a 3090:
| Sampler tail | Output (first 90 chars) |
|---|---|
(greedy) | Fluffy white giants / Sail through the sky on gentle / Wings of summer breeze |
1.0,0.5,0,1.0,99 | Clouds drift like cotton dreams floating through the sky. |
Two distinct decodes from the same prompt confirms the chain wires HTTP body all the way to sample_logits.
Reproduce
Model GGUF: Lucebox/Laguna-XS.2-GGUF (Q4_K_M 20.3 GB, BF16 66.9 GB, imatrix included).
# clone
git clone https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash
# build with sm_86 (3090 / A6000)
cmake -B build -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build -j
# fetch the Q4_K_M GGUF + Poolside tokenizer
hf download Lucebox/Laguna-XS.2-GGUF laguna-xs2-Q4_K_M.gguf --local-dir models/
hf download poolside/Laguna-XS.2 chat_template.jinja tokenizer.json tokenizer_config.json \
special_tokens_map.json config.json --local-dir models/Laguna-XS-2
# run the OpenAI server (same server.py as qwen35, arch auto-detected from GGUF).
# -ctk/-ctv q4_0 keeps the 131K KV cache under ~6 GB so weights + KV fit on 24 GB.
python3 scripts/server.py \
--target models/laguna-xs2-Q4_K_M.gguf \
--tokenizer models/Laguna-XS-2 \
--port 8000 --max-ctx 131072 \
-ctk q4_0 -ctv q4_0
# chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}' 75;86; pass -DCMAKE_CUDA_ARCHITECTURES=86 to skip the extra arch and shave compile time on a 3090.What is missing
- Spec-decode draft. Currently autoregressive at ~107 tok/s greedy at short context. Dflash with 27B suggests a proper draft would push decode past 200 tok/s.
- 256K context. Pipeline is tested to 131K. TQ3_0 KV on the target side already proved at qwen35 ctx 256K. Laguna at 256K is the next push.
Bottom line
Laguna XS.2 is the first MoE target on PFlash and a clean fit for consumer hardware. Apache 2.0, 3B active parameters, fits next to a Q4_0 KV cache in 24 GB, and PFlash compresses 128K context in 16 seconds on a used RTX 3090, so the long-context loop is no longer a wait state. A solid second model to keep loaded next to your dense Qwen.
Source: PR #116 on github.com/Luce-Org/lucebox-hub. Model GGUF: Lucebox/Laguna-XS.2-GGUF. Benchmark numbers from dflash/RESULTS.md on the integration branch, measured on RTX 3090 24 GB. Upstream model: poolside/Laguna-XS.2; deeper dive at poolside.ai/blog/laguna-a-deeper-dive.