Blog
Engineering notes on custom kernels, local inference, and hardware design.
Gemma 4 26B edges out DeepSeek V4 Flash (284B) on ds4-eval-92, at 5x the speed
ds4-eval-92 head-to-head: Gemma 4 26B (26B, 4-bit) on a 24 GB RTX 5090 Laptop ties DeepSeek V4 Flash (284B, ~2-bit) on a 192 GB Mac at 78.3%, and decodes about 5x faster.
Launch and tune Lucebox with real agent harnesses
Real-client profiles, launch scripts, and TQ3/DDTree results for OpenCode, Hermes, OpenClaw, Open WebUI, Codex, Claude Code, and Pi.
Laguna XS.2 on a 3090: 111 tok/s, 5.4x prefill, first MoE target for PFlash
Poolside Laguna XS.2 (33B-A3B) ported into dflash + PFlash in ten days as the first MoE target supported by PFlash. ~107 tok/s decode at short context, 15.91 s TTFT at 128K on a single RTX 3090, 5.4x faster prefill than llama.cpp.
DFlash + PFlash on AMD Strix Halo: 2.5× end-to-end vs llama.cpp HIP
PR #119 lands DFlash + PFlash on the Ryzen AI MAX+ 395 iGPU (gfx1151, 128 GiB unified). Qwen3.6-27B Q4_K_M: 26.85 tok/s DFlash decode (2.23×), 20.2 s PFlash prefill at 16K (3.05×), 2.51× end-to-end at 16K + 1K gen vs vanilla llama.cpp HIP on the same silicon.
PFlash: 10× prefill speedup over llama.cpp at 128K on a RTX 3090
Long context overwhelms Q4 27B targets on 24 GB GPUs. PFlash compresses 128K → 2.6K with a small drafter before dflash sees the prompt. Head-to-head cold-vs-cold: 24.8 s TTFT vs ~257 s llama.cpp (10.4×); NIAH retrieval preserved at every measured context.
DFlash on ggml: up to 207 tok/s Qwen3.5-27B on a RTX 3090
Standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with DFlash block-diffusion draft + DDtree verifier. 3.43x AR, 2.8x SGLang AWQ, 128K context on 24 GB.
The eGPU Myth: Why a ~$300 Dock Won't Turn Your GPU Into an AI Workstation
tinygrad wrote an NVIDIA driver from scratch. We ran real models on an RTX 3090 over USB4. The engineering is brilliant. The numbers aren't there yet. Full benchmarks and profiling.
Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090
The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers fused into a single CUDA dispatch. 1.87 tok/J, matching M5 Max efficiency at 1.8x the throughput on a 2020 GPU.