April 2026

NVIDIA GPUs Work on macOS Again. We Benchmarked Them. The Driver Is a Miracle. The Inference Is Not.

tinygrad wrote an NVIDIA driver from scratch and got Blackwell GPUs running on Macs over Thunderbolt. We ran real models and profiled everything. The engineering is beautiful. The numbers need time.

RTX 3090 + eGPU dock + MacBook running NVIDIA on macOS over USB4

TL;DR

For the first time since 2019, NVIDIA GPUs compute on macOS. tinygrad built the driver. Apple approved it. Alex Ziskind tested Blackwell GPUs (5060 Ti, 5070 Ti, 5090). We deep-profiled an RTX 3090 over USB4. The pattern across every card:

The driver works. Plug in, approve, compute. Five-minute setup.
The inference is slow. On Qwen3-8B Q4 (tinygrad's built-in benchmark), every eGPU lands between 2.28 and 6.0 tok/s. On Qwen3-4B Q4 via llama-bench, llama.cpp on Metal hits ~74 tok/s vs tinygrad eGPU's 7.39, about 10x faster.
The bottleneck is the software stack, not the cable. GPUs use 1.2-1.6% of their own memory bandwidth. USB4/Thunderbolt is not the constraint.
The hard part is done. Driver, memory manager, compiler pipeline all ship. What's left is the rest of the inference stack catching up.

Seven Years in the Desert

In 2018, Apple and NVIDIA had a falling out. Apple dropped NVIDIA support in macOS Mojave, killing CUDA entirely. They went all in on their own Metal GPU framework. For seven years, if you wanted NVIDIA compute on macOS, you were out of luck.

Then tinygrad did something nobody else would. They wrote their own NVIDIA GPU driver from scratch. A macOS driver extension (DEXT) called TinyGPU. No NVIDIA drivers needed. No Linux needed. You plug a GPU into your Mac's Thunderbolt port, approve the system extension, and it computes.

It's not a hack. Apple signed the extension. It supports Ampere, Ada Lovelace, and Blackwell on the NVIDIA side, and RDNA3+ on AMD. Setup is documented end-to-end: a curl command for the DEXT, a system-extension approval, and Docker Desktop for tinygrad's nvcc/ptxas path. NVIDIA stopped shipping a CUDA toolkit for macOS years ago, so the compiler runs in a container.

The Blackwell Numbers

Alex Ziskind tested three Blackwell GPUs on a 64 GB Mac Mini with the latest tinygrad. USB4 dock for the 5060 Ti, Razer Core X V2 ($349, PSU sold separately) for the bigger cards. Here's what he found:

Matrix multiplication

On tinygrad's matmul benchmark, the M4 Pro's Metal backend hit ~33 TFLOPS. The RTX 5060 Ti managed 22.7 TFLOPS, 31% slower than the built-in chip. The 5070 Ti got about a 64% boost over the 5060 Ti, and the 5090 came in slightly behind the 5070 Ti. None of the Blackwell cards pulled meaningfully ahead of the M4 Pro on this test.

External Blackwell GPUs matching but not beating an M4 Pro on raw matmul was not the result anyone expected. But matmul benchmarks measure tinygrad's compiler efficiency on both backends, not the GPUs themselves. The inference numbers are what matter.

LLM inference (tinygrad built-in benchmark)

GPU	Model	tok/s
RTX 5090 (eGPU)	Qwen3-8B	~6.0
RTX 5090 (eGPU)	Qwen3-30B MoE	~6.5
RTX 5090 (eGPU)	Llama 3.1-8B (INT8 quant)	~7.5
RTX 5070 Ti (eGPU)	Qwen3-8B	~5.5
RTX 5060 Ti (eGPU)	Qwen3-8B	~4.6
M4 Pro (Metal, tinygrad)	Qwen3-8B	3.66

Every eGPU beats tinygrad's own Metal backend. On tinygrad's built-in benchmark (Qwen3-8B Q4), the 5090 is about 64% faster than Metal (6.0 vs 3.66 tok/s). Switching to Qwen3-4B Q4 via llama-benchy for an end-to-end test, Alex measured 7.39 vs 4.29 tok/s, a 72% gap, with time to first token 3-4x faster on the eGPU. That's real progress.

But then Alex ran llama.cpp.

The llama.cpp Reality Check

Same Mac Mini. Same model (Qwen3-4B Q4). Different software:

llama-bench vs tinygrad on Qwen3-4B Q4

# tinygrad on RTX 5090 (eGPU)
tok/s: 7.39 TTFT: ~5,000 ms

# tinygrad on Metal (M4 Pro)
tok/s: 4.29

# llama.cpp on Metal (M4 Pro)
tok/s: ~74 TTFT: 651 ms

# speedup: llama.cpp is 10x faster than tinygrad eGPU
# speedup: llama.cpp is 18x faster than tinygrad Metal

On Qwen3-4B Q4, llama.cpp on the built-in M4 Pro is 10x faster than tinygrad on an RTX 5090 through an eGPU dock. Time to first token: 651 ms vs almost 5 seconds. (Note: tinygrad's first-token latency includes JIT shader compilation, which is a one-time cost per model. Subsequent prompts in the same session are faster, but the gap to llama.cpp remains large.)

It's a massive gap, but it isn't surprising. llama.cpp has years of hand-tuned Metal kernels, fused dequant+matmul for every GGUF format, KV cache layouts tuned per architecture, and thousands of contributors squeezing performance out of every path. tinygrad's NV backend is months old, and emits kernels from a general-purpose compiler rather than hand-writing them per model. Different stack, different point in its lifecycle.

We measured this ourselves on an RTX 3090 from two angles:

Setup	Model	tok/s
Native 3090 (llama.cpp CUDA)	Qwen3.5-35B-A3B (MoE)	109
MBP M5 Max (llama.cpp Metal)	Qwen3.5-35B-A3B (MoE)	89
Native 3090 (tinygrad CUDA, RunPod)	Qwen3-8B Q4	9.75
eGPU 3090 (tinygrad NAK)	Qwen3-8B Q4	2.28

Two separate signals. The top rows benchmark the 3090 against an M5 Max on a MoE model where only 3B params activate per token: native llama.cpp CUDA edges Metal by 22%, not a blowout. The bottom rows compare the same backend and same model native (on RunPod) vs on an eGPU: a 4.3x gap that isolates tinygrad's NAK kernel path from the hardware, the cable, and everything else.

The gap is real. But to understand whether it's permanent, you need to understand where the bottleneck actually is.

It's Not the Cable

Once model weights are loaded into GPU VRAM (a one-time transfer at startup), token generation is almost entirely GPU-internal. Weights read from VRAM, computed on the GPU, written back. The only data crossing Thunderbolt per token is a few kilobytes of embeddings and logits, negligible against a 5 GB/s link.

So the question is: how much of the GPU's own memory bandwidth is the software actually using?

GPU memory bandwidth utilization

# RTX 3090 (our profiling, tinygrad eGPU)
VRAM BW available: 936 GB/s
VRAM BW used:      10.8 GB/s (1.2%)

# RTX 5090 (Alex's test, tinygrad eGPU)
VRAM BW available: 1,792 GB/s
VRAM BW used:      ~28.8 GB/s (1.6%)

# RTX 3090 (native tinygrad CUDA, RunPod)
VRAM BW available: 936 GB/s
VRAM BW used:      209 GB/s (22%)

# USB4 link: 5 GB/s. Not the constraint.

The RTX 5090 can move 1.8 TB/s through its own memory. tinygrad uses 28.8 GB/s. That's not a cable problem. That's a stack maturity problem. It's also a tractable one: llama.cpp went through the same gap years ago and closed it.

The reason this is good news: the hardware path works perfectly. USB4 and Thunderbolt are not holding anything back. As tinygrad's NVIDIA inference path matures, the numbers move. No hardware change needed.

Why It's Slow (For Now)

This is where our deep profiling on the RTX 3090 explains what Alex's numbers can't. We ran the experiments with DEBUG=4 and traced every kernel dispatch.

(A note on compiler paths: both tests used tinygrad's NV backend, but with different SASS generators. Our month-old test used the NAK assembler (NV_NAK=1), which skips the Docker dependency. Alex's later test used the documented setup: NVIDIA's nvcc/ptxas running inside Docker. Two SASS paths, one runtime, same ~1-2% bandwidth utilization. The bottleneck lives upstream of the assembler, in tinygrad's kernel scheduling and fusion.)

On dense models (Qwen3-8B), tinygrad's profiler reports a low per-token dispatch count in both setups (the JIT fuses aggressively, so the count is much smaller than a layer-by-layer trace would imply). The eGPU 3090 over USB4 and the native PCIe x16 RunPod 3090 land at the same number. So the 4.3x gap between them isn't dispatch count. It's some mix of SASS quality (the eGPU build used the NAK assembler, RunPod used tinygrad's default nvcc/ptxas path) and per-launch latency over USB4. We didn't isolate the split.

On MoE models, the picture changes completely. Qwen3-30B-A3B logged thousands of dispatches per token in our DEBUG=4 trace (~2,800), and unlike the dense case the breakdown is dominated by per-expert work. Each dispatch is fast on its own, but at this density the GPU spends more time waiting between launches than computing. Result: 0.74 tok/s.

DEBUG=4 profiling: Qwen3-30B-A3B MoE per-token breakdown

Batch 1: 32 kernels 2.5ms 37 GB/s (only fast batch)
Batch 2: 64 kernels 34ms   5 GB/s
Batch 3: 128 kernels 68ms   5 GB/s
Batch 4: 256 kernels 145ms   6 GB/s
Batch 5: 512 kernels 210ms   7 GB/s
Batch 6: 1024 kernels 465ms 6 GB/s
Batch 7: 829 kernels 418ms   7 GB/s

Total: 2,845 kernels, ~1,343ms per token → 0.74 tok/s

For reference, tinygrad's AMD backend already delivers ~50 tok/s on Qwen3.5-9B (on a 7900 XTX). The NAK (NVIDIA) backend is younger. It has room to grow.

The Hard Part Is Done

Already shipped (the hard part):

Open-source NVIDIA GPU driver for macOS (Apple-signed DEXT)
Memory manager and PCIe BAR mapping over USB4/Thunderbolt
Compiler integration: nvcc/ptxas in Docker (the documented path) plus an alternative NAK assembler that skips Docker
AMD path via comgr (HIP code object manager)
Support for NVIDIA Ampere, Ada Lovelace, Blackwell, and AMD RDNA3+
Setup that fits in two curl commands and one system-extension toggle

What's left (the maturity part):

Fused operator paths (QKV, RoPE, attention, output projection in one dispatch)
MoE expert routing fused into single launches instead of thousands
Quantization-aware matmul kernels for each GGUF format
KV cache layouts tuned per architecture
JIT warmup amortization and scheduler heuristics
Memory bandwidth utilization (1.2-1.6% has a long way to climb)

The driver, the compiler integration, the memory manager. That's the infrastructure nobody else was willing to build, and it's done. The rest is the same kind of work that fills llama.cpp's commit history every week: fused operator paths, dequant+matmul, KV cache layouts, expert routing, JIT warmup amortization. Hard work, but known work.

Where Things Stand

	eGPU (tinygrad)	llama.cpp (Metal)	llama.cpp (native CUDA)
NVIDIA on Mac?	Yes (first since 2019)	No (Metal only)	No (Linux)
Qwen3-8B Q4 (tinygrad built-in)	2.28 (3090) – 6.0 (5090)	3.66 (M4 Pro, tinygrad)
Qwen3-4B Q4 (llama-bench)	7.39 (5090)	74 (M4 Pro, llama.cpp)
Qwen3.5-35B-A3B MoE (llama.cpp)		89 (M5 Max)	109 (3090)
GPU BW util (measured)	1.2-1.6%	not measured	22% (tinygrad CUDA)
Setup time	~5 min	~2 min	N/A
Dock cost	$130-300	$0	N/A
Maturity	Months old	Years	Years

The Bottom Line

For fast LLM inference on a Mac today, the answer is llama.cpp or MLX on Metal. On Qwen3-4B Q4, llama.cpp is roughly 10x faster than tinygrad over an eGPU dock, it's free, and it works out of the box.

For whether NVIDIA GPUs will ever work on macOS again, they already do. The driver, the compiler integration, the runtime are all in place. The inference stack on top of them is months old, and it shows.

Don't buy the dock expecting M5 Max throughput today. But don't dismiss what tinygrad built, either. We'll watch the NV backend closely. If it gets a year of llama.cpp-style attention, the eGPU dock turns from a curiosity into a real option for local LLMs on a Mac. Just not yet.

For the full community benchmark context, see Alex Ziskind's Blackwell tests on YouTube.

The gap between eGPU and native isn't a hardware gap, it's a software maturity gap. The hardware does its job. The inference stack on top of it needs time.

What we're working on

We're a small team pushing LLM inference efficiency on the best hardware for local AI: custom CUDA kernels, hybrid architectures, power tuning. This eGPU investigation is one corner of that work. The most relevant neighbor writeup is our megakernel, a single-dispatch CUDA kernel that on the same RTX 3090 hits 413 tok/s at 220W (1.87 tok/J) on a small hybrid DeltaNet/Attention model. Same silicon, much more throughput when the software actually uses it.

Mac + NVIDIA, finally working.

We'll keep watching tinygrad's NV backend mature.

GitHub Megakernel Post Compare hardware Discord