The computer for local AI.

Lucebox pairs an RTX 3090 with a Ryzen AI MAX+ 395 and a custom inference engine tuned for it. It beats every machine at its price on tokens per second, and ships with everything pre-installed.

We have a limited number of Lucebox in production.
Priority given to the ones who come first.

Partners

AMD
NVIDIA
Poolside

The gap

Nothing else was built for local AI inference.

Most machines pair a stock GPU with a desktop CPU and run stock inference, missing the GPU-and-unified-memory pairing local AI needs. Four to six times the real throughput, left on the table.

01

Hardware not designed for inference.

A GPU bolted to a desktop CPU serves desktops, not LLMs. No pairing of fast VRAM with large unified memory for context.

02

No software tuned to the chip.

Today's runtimes run one generic path on every GPU, with no kernel-level tuning for the silicon underneath. Most of what the hardware can actually do is left on the table.

03

Four to six times slower than they should be.

On the same 27B model, a DGX Spark or Mac Studio leaves four to six times the real throughput on the table, simply because the stack was never tuned to the silicon.

We know. That's why we built Lucebox.

We pair the silicon the way local AI needs and hand-tune the engine to the chip. The box we always wanted, built for four-to-six-times the throughput.

The machine

Strix Halo and RTX 3090, in one small box.

Memory

128 GB unified

Big models and long context in memory.

Throughput

RTX 3090, 24 GB

10,496 CUDA cores for raw tok/s.

Ready

Tuned and pre-loaded

Pre-loaded and tuned. Plug in, go.

Price

$5,900 $4,900 -17%

Launch price, ends June 30.

Specs & design

Premium aluminium, 9.56 litres.

A 9.56-litre chassis machined from premium aluminium, designed and built by hand, in-house.

128 GB LPDDR5X Unified on the Ryzen AI MAX+ 395, 16 cores, ~256 GB/s.
24 GB GDDR6X RTX 3090, 10,496 CUDA cores at 936 GB/s.
2 TB NVMe SSD Fast on-device storage for models and data.
Full local privacy Inference stays on the box. Data never leaves.
500 W thermal system About 500 W, cooled for sustained inference.
Corsair PSU 750 W 80+ Gold, headroom to spare, 24/7 ready.
Custom PCIe x4 GPU to SoC over PCIe Gen4 x4, custom riser.
Zero setup Pre-loaded and tuned. Pair over Bluetooth, go.
Lucebox technical blueprint: GPU, APU, memory bandwidth and chassis dimensions

Performance

Fastest tok/s in its price range.

Hand-tuned inference, custom CUDA kernels, speculative decoding. Real numbers on real hardware, no synthetic scores.

Throughput · Qwen3.5-27B Q4_K_M

Up to 207 tok/s

DFlash on ggml, ~130 tok/s sustained, vs a DGX Spark or Mac Studio (estimated) on the same model class.

4–6×faster

Competitor figures are estimates for the same model class.

Cost · USD over 2 years, 8h/day

$5,900 $4,900 once

The full machine, one fixed cost, against equivalent cloud spend over two years at 8h/day.

cheaper

Cloud: Claude Sonnet 4.5 ($3 / $15 per 1M), GPT-4o list.

Open source

Proudly open source, and built in public.

Fully open in software and built in the open. Loved by a community of builders, with thousands of developers using and contributing to our repo already, and counting.

AAhmad Ahmad @TheAhmadOsman

I like your stuff so far, keep going

SSudo su Sudo su @sudoingX

this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026.

LLotto Lotto @LottoLabs

Interesting run w/ Dflash from the lucebox-hub guys

RRijndael Rijndael @rot13maxi

speculative PREFILL?????

RRiku Pasonen ๐ŸŒž Riku Pasonen ๐ŸŒž @Raitziger

I have tested some LLM server software for home PCs for Linux and Windows. Fastest and best for running home is Linux running 145 t/s, Lucebox. @pupposandro @luceboxai

ffahd Mirza fahd Mirza @fahdmirza

๐Ÿ’ฅ PFlash just killed the 4-minute blank screen problem. ๐Ÿš€ 128K token prefill in 25 seconds, same GPU, same model, no compromises

GGeek Lite Geek Lite @QingQ77

Consumer-grade GPUs actually have sufficient hardware potential, general-purpose frameworks just waste most of it on overhead. Lucebox releases that potential through hand-written kernels, letting even a 2020 RTX 3090 rival Apple's latest chips on efficiency.

TTakyonโˆž Takyonโˆž @Takyon

Crazy I was litteraly wondering how can I increase my token speed 10 min ago

KKyz2ren Kyz2ren @ky2renzz

Crazy what @pupposandro just dropped on Qwen3.5-27B. 207 tok/s on a single 3090 with Q4_K_M and full DFlash speculative? Chinese labs + ggml hacks are just cooking on consumer hardware right now. This is the kind of local win I like to see.

IIvan Fioravanti Ivan Fioravanti @ivanfioravanti

RTX 3090 ready for a new life! Bringing it to @luceboxai team to make some experiments together ๐Ÿ˜Ž

vvitalik.eth vitalik.eth @VitalikButerin

github.com/Luce-Org/lucebox-hub looks promising as a way to run "dense" models (eg. Qwen 27B) more efficiently. It's janky, but on my 5090 laptop it seems to be ~2x more tok/s than llama.cpp

CCAPET โ˜€๏ธ CAPET โ˜€๏ธ @Capetlevrai

This is very very good work BRAVO. I love it

nnash_su nash_su @nash_su

Nearly 10x faster! After finishing Decoding, it starts cranking through Prefill. The previous DFlash was already stunning enough, and now they've added PFlash. Speculative prefill, up to 10x speedup. Go try it right now.

AAJ AJ @ItsmeAjayKV

First time reading about speculative prefill, and it's crazy. 257s down to 24s for a 128K prompt on a single RTX 3090. Great article, definitely go ahead and give this a read.

TTwon. Twon. @Web3Twon

impressive... so this is what it looks like when you focus on a set ram limit and optimizing for a single model above everything

What you get

Plug-and-play, end to end.

Plug in, pair over Bluetooth, point your tools at it. From box to agent in about a minute.

$ brew install lucebox && luce โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ โ–‘โ–ˆโ–ˆ Looking for your LuceBox via Bluetooth... Found: LuceBox-A7F3 Scanning WiFi networks... Available networks: home-5g, home-2.4, guest, xfinitywifi Select (1-4): 1 Password: โ—โ—โ—โ—โ—โ—โ—โ— Connecting to home-5g... โœ“ Connected. Dashboard at http://lucebox.local โœ“ LuceBox is ready. โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ”‚ LuceBox โ”‚ โ”‚ โ”‚ โ”‚ Model: Qwen3.6-27B Q4_K_M โ”‚ โ”‚ LLM: โ— online โ”‚ โ”‚ Endpoint: http://lucebox.local:8080 โ”‚ โ”‚ โ”‚ โ”‚ Commands โ”‚ โ”‚ โ”‚ โ”‚ /status System overview โ”‚ โ”‚ /models List models loaded โ”‚ โ”‚ /engine Runs our lucebox-hub engine โ”‚ โ”‚ /harness Switch agent harness โ”‚ โ”‚ /shell Open SSH shell โ”‚ โ”‚ /help All commands โ”‚ โ”‚ โ”‚ โ”‚ Type a message, or / for commands โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ > Hey, I'm your Lucebox. How can I help you?โ–Š

Zero setup, instant go.

No CUDA install, no Docker, no model downloads to babysit. Plug in, pair over Bluetooth, and your model is already loaded in VRAM. About a minute from box to first token.

Works with the tools you already use.

Our harness ships launchers and regression tests for the agents below. Any OpenAI or Anthropic compatible client just works. Local by default, cloud fallback on demand.

The chips we'd pick ourselves.

RTX 3090 with 24GB GDDR6X next to the Ryzen AI MAX+ 395 and 128GB LPDDR5X unified memory, over a custom 90-degree PCIe Gen4 x4 riser. 35B models sit entirely in VRAM.

Natively supports

Claude Code
Claude Code
Codex
Codex
OpenCode
OpenCode
Hermes
Hermes
OpenClaw
OpenClaw
Open WebUI
Open WebUI
Ollama
Ollama

Every machine, battle-tested.

We do not ship anything we have not run, every box is rebuilt and proven before it leaves.

Refurbished to like-new

Every RTX 3090 is fully disassembled, repasted, and re-padded, then benchmarked until it performs like a new card, with many years of life ahead.

Thermals proven, no surprises

We measure the box under sustained inference and confirm there are no hotspots and no throttling. The cooling is sized for the real load, not a spec sheet.

72-hour max stress test

We push every machine to its limits for three full days, real inference and memory under maximum load, so any weak part fails here on our bench, never later on your desk.

Disassembly and inspection
01Disassembly & inspection
Die repaste and VRAM pad replacement
02Die repaste & VRAM pads
03Thermal & memory burn-in
04LLM benchmarking
One year warranty badge
051-year warranty issued

One-year hardware warranty. Full system, parts and labor. A refurbished RTX 3090 has years of life left in it, and if anything fails under normal use we repair or replace it, no questions.

Apply

Apply to reserve your Lucebox.

A strictly limited first production batch, and we build only a handful at a time. $4,900 USD per machine until June 30, then the price moves to $5,900. Applying is free, you only pay once we reach out with the details and you decide to go ahead. Applications are reviewed daily, and multi-unit orders get priority on selection.

Launch price $4,900 $5,900 17% off

Ends June 30 · --d --h --m --s

Free to apply · No card required · You only pay after we confirm a unit for you.

1

Apply with your use case. Free, about a minute.

2

We review daily and email you the details.

3

Pay only if you choose to go ahead.

FAQ

Frequently asked.

A plug-and-play computer for local AI inference. It ships pre-tuned with custom CUDA kernels and speculative decoding, exposes a single CLI, and gives you the best tok/s speed in its price range.
RTX 3090 blower with 24GB GDDR6X, paired with the AMD Ryzen AI MAX+ 395 and 128GB LPDDR5X unified memory. The GPU runs over four PCIe Gen4 x4 lanes on a custom 90-degree riser between the SoC and the GPU. Chassis is 9.56 liters, fits in a backpack.
$4,900 USD per machine until June 30, then $5,900. That's a 17% launch discount. First production batch ships July 2026 on a rolling basis. Apply above to lock in the launch price.
First production batch starts shipping July 2026. We review applications daily and email selected applicants with payment and shipping details.
Yes. Companies ordering multiple units get priority on both selection and shipping speed. Mention your quantity in the application.
Yes. Our inference work lives at github.com/Luce-Org/lucebox-hub with 2,200+ stars, 45+ contributors, hand-tuned CUDA kernels, DFlash speculative decoding, PFlash prefill, and a megakernel for hybrid LLMs.
Far more than fits in 24GB of VRAM. The Strix Halo's 128GB of unified memory lets you run much larger models too, like MiniMax 2.7 and DeepSeek V4-Flash. We pre-tune Qwen3.6, GLM-4.6, DeepSeek V4-Flash, and the Llama 4 family, and you can bring your own GGUF, the CLI takes care of loading it.
Yes. Any tool that speaks the Anthropic Messages API works. Set the base URL to your Lucebox and you are done.
Works out of the box. The OS exposes an OpenAI-compatible endpoint as well, so any client that calls OpenAI can call Lucebox instead.
No. Lucebox runs fully offline once your model is loaded. Internet is only needed for updates and the optional cloud fallback.
Quiet under normal load. Under sustained inference the blower fan ramps up but the chassis directs exhaust out the back and keeps the system in a comfortable office-noise range.
Around 500 watts under full inference load, roughly 350W for the RTX 3090 and 150W for the Ryzen platform. Idle sits much lower. Plug it into any standard wall outlet.
One year on the full machine, parts and labor. We replace or repair any unit that fails under normal use.
Yes. Full root access. It is your machine. Open the shell, install whatever you want, run whatever you want.
If the machine arrives DOA or fails under warranty, we cover return shipping and either repair, replace, or refund. Details in the order confirmation email.
Yes. Pricing in the application is the base unit price, import duties and shipping are quoted on selection based on destination country.
The $4,900 launch price is for the machine only and excludes VAT and shipping. Duties, import taxes, and shipping are the buyer's responsibility, quoted and charged on top of the base price once we confirm your unit and destination.
Yes. The 128 GB of unified memory plus 24 GB of VRAM lets you keep a hot model resident in VRAM while a second sits in unified memory ready to swap in. The CLI handles loading and routing per request.

Thesis

Why Local AI, and why now.

01

Tokens are getting expensive.

Coding agents now burn $100 to $1000 per developer every month. Frontier model prices keep climbing as demand grows. Local inference flips the math: fixed hardware cost, unlimited tokens, predictable budget.

02

Privacy is non-negotiable.

Regulated industries cannot ship customer data, source code, or patient records to a cloud LLM. CTOs need an air-gapped option that runs on-prem and stays inside the building. Lucebox is that option, end to end.

03

Open source AI must win.

Qwen3.6, GLM-4.6, and DeepSeek V4-Flash are catching up to closed frontier models fast. Open weights mean no vendor lock-in, real auditability, and a stack you actually own. The hardware to run them well is the missing piece.

Engineering Blog

Read more about our experiments and benchmarks.

Speculative Prefill

PFlash: 10× prefill at 128K

24.8s TTFT vs ~257s on llama.cpp for Qwen3.6-27B on a single RTX 3090. NIAH retrieval preserved.

Read the blogpost →

Speculative Decode

DFlash on ggml: 207 tok/s

Qwen3.5-27B Q4_K_M at up to 207 tok/s, 3.43× over autoregressive, 128K context on 24 GB.

Read the blogpost →

MoE

Luce Spark: 35B MoE on 16 GB

Qwen3.6 35B-A3B in 13.3 GiB, Laguna XS.2 in 14.6 GiB. Only the experts traffic uses stay resident. Self-tuning, one flag.

Read the blogpost →

AMD Strix Halo

DFlash + PFlash on Ryzen

26.85 tok/s decode on the gfx1151 iGPU, 2.23× faster decode than llama.cpp HIP on the same silicon.

Read the blogpost →

MoE

Laguna XS.2 on a 3090

Poolside MoE ported into DFlash + PFlash in ten days. 111 tok/s decode, 5.4× faster prefill.

Read the blogpost →

Docker

One image for every supported GPU

One prebuilt image from the RTX 2080 Ti to the RTX 5090. Two host deps, self-tuning, docker run --gpus all.

Read the blogpost →
Read all posts