Blog

Engineering notes on custom kernels, local inference, and hardware design.

Laguna XS 2.1 on an RTX 3090: 296 tok/s peak, flat 152 tok/s at 256K context

Laguna XS 2.1 on a RTX 3090: 296 tok/s peak, 152 tok/s at 256K context

poolside's coding MoE with its official DFlash drafter: 296 tok/s peak at short context, a flat 152 tok/s at 256K tokens, prefill at 3,500 tok/s. Lossless speculative decoding plus KVFlash paging plus two new model-agnostic engine optimizations, on one 24 GB card.

July 2026 8 min read

Luce KVFlash: a small resident pool of KV on the GPU, the rest of a 256K context paged to host RAM

LONG CONTEXT

Luce KVFlash: 256K context with 72 MiB of KV on the GPU

On Qwen3.6-27B the KV cache costs 4.6 GiB at 256K and drags decode to 13 tok/s. KVFlash pages cold 64-token chunks to host RAM bit-exact: decode holds a flat 38.6 tok/s from 64K to 256K on a 3090, accuracy unchanged. One flag, every model family.

June 2026 9 min read

DOCKER

Lucebox in a container: one image for every supported GPU

A prebuilt image spanning the RTX 2080 Ti to the RTX 5090 (sm_75 to sm_120, CUDA 12.8). The ~25-minute fat-binary compile happens once in CI, not on your box. Two host deps (docker + nvidia-smi), self-tuning, provenance at /props. docker run --gpus all.

June 2026 8 min read

$Luce Spark serving a 33-35B MoE from a fraction of the experts on consumer memory$

MOE

Luce Spark: fit Qwen3.6 35B and Laguna XS.2 on a 16 GB GPU

A 33-35B MoE fires ~8 of 256 experts per token but pays for all of them in VRAM. Spark keeps resident only the experts traffic uses and swaps the rest: Qwen3.6 35B-A3B in 13.3 GiB, Laguna XS.2 in 14.6 GiB, self-tuning, one flag.

June 2026 8 min read

Gemma 4 26B on an RTX 5090 Laptop next to DeepSeek V4 on a MacBook

BENCHMARK

Gemma 4 26B edges out DeepSeek V4 Flash (284B) on ds4-eval-92, at 5x the speed

ds4-eval-92 head-to-head: Gemma 4 26B (26B, 4-bit) on a 24 GB RTX 5090 Laptop ties DeepSeek V4 Flash (284B, ~2-bit) on a 192 GB Mac at 78.3%, and decodes about 5x faster.

May 2026 7 min read

Lucebox client harness experiments on a RTX 3090

AGENTS

Launch and tune Lucebox with real agent harnesses

Real-client profiles, launch scripts, and TQ3/DDTree results for OpenCode, Hermes, OpenClaw, Open WebUI, Codex, Claude Code, and Pi.

May 2026 9 min read

Laguna XS.2 running on a single RTX 3090 inside the dflash daemon

MOE

Laguna XS.2 on a 3090: 111 tok/s, 5.4x prefill, first MoE target for PFlash

Poolside Laguna XS.2 (33B-A3B) ported into dflash + PFlash in ten days as the first MoE target supported by PFlash. ~107 tok/s decode at short context, 15.91 s TTFT at 128K on a single RTX 3090, 5.4x faster prefill than llama.cpp.

May 2026 10 min read

AMD Strix Halo running Qwen3.6-27B locally via lucebox

AMD

DFlash + PFlash on AMD Strix Halo: 2.5× end-to-end vs llama.cpp HIP

PR #119 lands DFlash + PFlash on the Ryzen AI MAX+ 395 iGPU (gfx1151, 128 GiB unified). Qwen3.6-27B Q4_K_M: 26.85 tok/s DFlash decode (2.23×), 20.2 s PFlash prefill at 16K (3.05×), 2.51× end-to-end at 16K + 1K gen vs vanilla llama.cpp HIP on the same silicon.

May 2026 8 min read

PFlash speculative prefill compression for Dflash

SPEC PREFILL

PFlash: 10× prefill speedup over llama.cpp at 128K on a RTX 3090

Long context overwhelms Q4 27B targets on 24 GB GPUs. PFlash compresses 128K → 2.6K with a small drafter before dflash sees the prompt. Head-to-head cold-vs-cold: 24.8 s TTFT vs ~257 s llama.cpp (10.4×); NIAH retrieval preserved at every measured context.

April 2026 12 min read

SPEC DECODE

DFlash on ggml: up to 207 tok/s Qwen3.5-27B on a RTX 3090

Standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with DFlash block-diffusion draft + DDtree verifier. 3.43x AR, 2.8x SGLang AWQ, 128K context on 24 GB.

April 2026 10 min read

RTX 3090 + eGPU dock + MacBook, running NVIDIA on macOS over USB4

BENCHMARK

The eGPU Myth: Why a ~$300 Dock Won't Turn Your GPU Into an AI Workstation

tinygrad wrote an NVIDIA driver from scratch. We ran real models on an RTX 3090 over USB4. The engineering is brilliant. The numbers aren't there yet. Full benchmarks and profiling.

April 2026 12 min read

CUDA

Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers fused into a single CUDA dispatch. 1.87 tok/J, matching M5 Max efficiency at 1.8x the throughput on a 2020 GPU.

April 2026 15 min read