Blog

Engineering notes on custom kernels, local inference, and hardware design.

Gemma 4 26B on an RTX 5090 Laptop next to DeepSeek V4 on a MacBook

Gemma 4 26B edges out DeepSeek V4 Flash (284B) on ds4-eval-92, at 5x the speed

ds4-eval-92 head-to-head: Gemma 4 26B (26B, 4-bit) on a 24 GB RTX 5090 Laptop ties DeepSeek V4 Flash (284B, ~2-bit) on a 192 GB Mac at 78.3%, and decodes about 5x faster.

Lucebox client harness experiments on a RTX 3090

Launch and tune Lucebox with real agent harnesses

Real-client profiles, launch scripts, and TQ3/DDTree results for OpenCode, Hermes, OpenClaw, Open WebUI, Codex, Claude Code, and Pi.

Laguna XS.2 running on a single RTX 3090 inside the dflash daemon

Laguna XS.2 on a 3090: 111 tok/s, 5.4x prefill, first MoE target for PFlash

Poolside Laguna XS.2 (33B-A3B) ported into dflash + PFlash in ten days as the first MoE target supported by PFlash. ~107 tok/s decode at short context, 15.91 s TTFT at 128K on a single RTX 3090, 5.4x faster prefill than llama.cpp.

AMD Strix Halo running Qwen3.6-27B locally via lucebox

DFlash + PFlash on AMD Strix Halo: 2.5× end-to-end vs llama.cpp HIP

PR #119 lands DFlash + PFlash on the Ryzen AI MAX+ 395 iGPU (gfx1151, 128 GiB unified). Qwen3.6-27B Q4_K_M: 26.85 tok/s DFlash decode (2.23×), 20.2 s PFlash prefill at 16K (3.05×), 2.51× end-to-end at 16K + 1K gen vs vanilla llama.cpp HIP on the same silicon.

PFlash speculative prefill compression for Dflash

PFlash: 10× prefill speedup over llama.cpp at 128K on a RTX 3090

Long context overwhelms Q4 27B targets on 24 GB GPUs. PFlash compresses 128K → 2.6K with a small drafter before dflash sees the prompt. Head-to-head cold-vs-cold: 24.8 s TTFT vs ~257 s llama.cpp (10.4×); NIAH retrieval preserved at every measured context.

Qwen3.5-27B DFlash on ggml

DFlash on ggml: up to 207 tok/s Qwen3.5-27B on a RTX 3090

Standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with DFlash block-diffusion draft + DDtree verifier. 3.43x AR, 2.8x SGLang AWQ, 128K context on 24 GB.

RTX 3090 + eGPU dock + MacBook, running NVIDIA on macOS over USB4

The eGPU Myth: Why a ~$300 Dock Won't Turn Your GPU Into an AI Workstation

tinygrad wrote an NVIDIA driver from scratch. We ran real models on an RTX 3090 over USB4. The engineering is brilliant. The numbers aren't there yet. Full benchmarks and profiling.

RTX 3090, the GPU behind the megakernel

Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers fused into a single CUDA dispatch. 1.87 tok/J, matching M5 Max efficiency at 1.8x the throughput on a 2020 GPU.