Blog
Engineering notes on custom kernels, local inference, and hardware design.
BENCHMARK
The eGPU Myth: Why a ~$300 Dock Won't Turn Your GPU Into an AI Workstation
tinygrad wrote an NVIDIA driver from scratch. We ran real models on an RTX 3090 over USB4. The engineering is brilliant. The numbers aren't there yet. Full benchmarks and profiling.
CUDA
Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090
The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers fused into a single CUDA dispatch. 1.87 tok/J, matching M5 Max efficiency at 1.8x the throughput on a 2020 GPU.