Partners

The gap
Most machines pair a stock GPU with a desktop CPU and run stock inference, missing the GPU-and-unified-memory pairing local AI needs. Four to six times the real throughput, left on the table.
01
A GPU bolted to a desktop CPU serves desktops, not LLMs. No pairing of fast VRAM with large unified memory for context.
02
Today's runtimes run one generic path on every GPU, with no kernel-level tuning for the silicon underneath. Most of what the hardware can actually do is left on the table.
03
On the same 27B model, a DGX Spark or Mac Studio leaves four to six times the real throughput on the table, simply because the stack was never tuned to the silicon.
We pair the silicon the way local AI needs and hand-tune the engine to the chip. The box we always wanted, built for four-to-six-times the throughput.
The machine
Big models and long context in memory.
10,496 CUDA cores for raw tok/s.
Pre-loaded and tuned. Plug in, go.
Launch price, ends June 30.
Specs & design
A 9.56-litre chassis machined from premium aluminium, designed and built by hand, in-house.
Performance
Hand-tuned inference, custom CUDA kernels, speculative decoding. Real numbers on real hardware, no synthetic scores.
Throughput · Qwen3.5-27B Q4_K_M
DFlash on ggml, ~130 tok/s sustained, vs a DGX Spark or Mac Studio (estimated) on the same model class.
Competitor figures are estimates for the same model class.
Cost · USD over 2 years, 8h/day
The full machine, one fixed cost, against equivalent cloud spend over two years at 8h/day.
Cloud: Claude Sonnet 4.5 ($3 / $15 per 1M), GPT-4o list.
Open source
Fully open in software and built in the open. Loved by a community of builders, with thousands of developers using and contributing to our repo already, and counting.
I like your stuff so far, keep going
this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026.
Interesting run w/ Dflash from the lucebox-hub guys
speculative PREFILL?????
I have tested some LLM server software for home PCs for Linux and Windows. Fastest and best for running home is Linux running 145 t/s, Lucebox. @pupposandro @luceboxai
๐ฅ PFlash just killed the 4-minute blank screen problem. ๐ 128K token prefill in 25 seconds, same GPU, same model, no compromises
Consumer-grade GPUs actually have sufficient hardware potential, general-purpose frameworks just waste most of it on overhead. Lucebox releases that potential through hand-written kernels, letting even a 2020 RTX 3090 rival Apple's latest chips on efficiency.
Crazy I was litteraly wondering how can I increase my token speed 10 min ago
Crazy what @pupposandro just dropped on Qwen3.5-27B. 207 tok/s on a single 3090 with Q4_K_M and full DFlash speculative? Chinese labs + ggml hacks are just cooking on consumer hardware right now. This is the kind of local win I like to see.
RTX 3090 ready for a new life! Bringing it to @luceboxai team to make some experiments together ๐
github.com/Luce-Org/lucebox-hub looks promising as a way to run "dense" models (eg. Qwen 27B) more efficiently. It's janky, but on my 5090 laptop it seems to be ~2x more tok/s than llama.cpp
This is very very good work BRAVO. I love it
Nearly 10x faster! After finishing Decoding, it starts cranking through Prefill. The previous DFlash was already stunning enough, and now they've added PFlash. Speculative prefill, up to 10x speedup. Go try it right now.
First time reading about speculative prefill, and it's crazy. 257s down to 24s for a 128K prompt on a single RTX 3090. Great article, definitely go ahead and give this a read.
impressive... so this is what it looks like when you focus on a set ram limit and optimizing for a single model above everything
What you get
Plug in, pair over Bluetooth, point your tools at it. From box to agent in about a minute.
No CUDA install, no Docker, no model downloads to babysit. Plug in, pair over Bluetooth, and your model is already loaded in VRAM. About a minute from box to first token.
Our harness ships launchers and regression tests for the agents below. Any OpenAI or Anthropic compatible client just works. Local by default, cloud fallback on demand.
RTX 3090 with 24GB GDDR6X next to the Ryzen AI MAX+ 395 and 128GB LPDDR5X unified memory, over a custom 90-degree PCIe Gen4 x4 riser. 35B models sit entirely in VRAM.
Natively supports






We do not ship anything we have not run, every box is rebuilt and proven before it leaves.
Every RTX 3090 is fully disassembled, repasted, and re-padded, then benchmarked until it performs like a new card, with many years of life ahead.
We measure the box under sustained inference and confirm there are no hotspots and no throttling. The cooling is sized for the real load, not a spec sheet.
We push every machine to its limits for three full days, real inference and memory under maximum load, so any weak part fails here on our bench, never later on your desk.
Apply
A strictly limited first production batch, and we build only a handful at a time. $4,900 USD per machine until June 30, then the price moves to $5,900. Applying is free, you only pay once we reach out with the details and you decide to go ahead. Applications are reviewed daily, and multi-unit orders get priority on selection.
Ends June 30 · --d --h --m --s
Apply with your use case. Free, about a minute.
We review daily and email you the details.
Pay only if you choose to go ahead.
Thesis
01
Coding agents now burn $100 to $1000 per developer every month. Frontier model prices keep climbing as demand grows. Local inference flips the math: fixed hardware cost, unlimited tokens, predictable budget.
02
Regulated industries cannot ship customer data, source code, or patient records to a cloud LLM. CTOs need an air-gapped option that runs on-prem and stays inside the building. Lucebox is that option, end to end.
03
Qwen3.6, GLM-4.6, and DeepSeek V4-Flash are catching up to closed frontier models fast. Open weights mean no vendor lock-in, real auditability, and a stack you actually own. The hardware to run them well is the missing piece.
Engineering Blog
Speculative Prefill
24.8s TTFT vs ~257s on llama.cpp for Qwen3.6-27B on a single RTX 3090. NIAH retrieval preserved.
Read the blogpost →Speculative Decode
Qwen3.5-27B Q4_K_M at up to 207 tok/s, 3.43× over autoregressive, 128K context on 24 GB.
Read the blogpost →MoE
Qwen3.6 35B-A3B in 13.3 GiB, Laguna XS.2 in 14.6 GiB. Only the experts traffic uses stay resident. Self-tuning, one flag.
Read the blogpost →AMD Strix Halo
26.85 tok/s decode on the gfx1151 iGPU, 2.23× faster decode than llama.cpp HIP on the same silicon.
Read the blogpost →MoE
Poolside MoE ported into DFlash + PFlash in ten days. 111 tok/s decode, 5.4× faster prefill.
Read the blogpost →Docker
One prebuilt image from the RTX 2080 Ti to the RTX 5090. Two host deps, self-tuning, docker run --gpus all.
Read the blogpost →