A plug-and-play computer for local AI inference. It ships pre-tuned with custom CUDA kernels and speculative decoding, exposes a single CLI, and gives you the best tok/s speed in its price range.

What hardware is inside?

RTX 3090 blower with 24GB GDDR6X, paired with the AMD Ryzen AI MAX+ 395 and 128GB LPDDR5X unified memory. The GPU runs over four PCIe Gen4 x4 lanes on a custom 90-degree riser between the SoC and the GPU. Chassis is 9.56 liters, fits in a backpack.

$4,900 USD per machine until June 30, then $5,900. That's a 17% launch discount. First production batch ships July 2026 on a rolling basis. Apply above to lock in the launch price.

First production batch starts shipping July 2026. We review applications daily and email selected applicants with payment and shipping details.

Can larger orders ship sooner?

Yes. Companies ordering multiple units get priority on both selection and shipping speed. Mention your quantity in the application.

Is the software open source?

Yes. Our inference work lives at github.com/Luce-Org/lucebox-hub with 2,200+ stars, 45+ contributors, hand-tuned CUDA kernels, DFlash speculative decoding, PFlash prefill, and a megakernel for hybrid LLMs.

What models can I run?

Far more than fits in 24GB of VRAM. The Strix Halo's 128GB of unified memory lets you run much larger models too, like MiniMax 2.7 and DeepSeek V4-Flash. We pre-tune Qwen3.6, GLM-4.6, DeepSeek V4-Flash, and the Llama 4 family, and you can bring your own GGUF, the CLI takes care of loading it.

Can I use it with Claude Code, Cursor, OpenCode?

Yes. Any tool that speaks the Anthropic Messages API works. Set the base URL to your Lucebox and you are done.

What about Ollama and LM Studio?

Works out of the box. The OS exposes an OpenAI-compatible endpoint as well, so any client that calls OpenAI can call Lucebox instead.

Does it need internet?

No. Lucebox runs fully offline once your model is loaded. Internet is only needed for updates and the optional cloud fallback.

Quiet under normal load. Under sustained inference the blower fan ramps up but the chassis directs exhaust out the back and keeps the system in a comfortable office-noise range.

What's the power draw?

Around 500 watts under full inference load, roughly 350W for the RTX 3090 and 150W for the Ryzen platform. Idle sits much lower. Plug it into any standard wall outlet.

One year on the full machine, parts and labor. We replace or repair any unit that fails under normal use.

Yes. Full root access. It is your machine. Open the shell, install whatever you want, run whatever you want.

What about returns and refunds?

If the machine arrives DOA or fails under warranty, we cover return shipping and either repair, replace, or refund. Details in the order confirmation email.

Do you ship internationally?

Yes. Pricing in the application is the base unit price, import duties and shipping are quoted on selection based on destination country.

The $4,900 launch price is for the machine only and excludes VAT and shipping. Duties, import taxes, and shipping are the buyer's responsibility, quoted and charged on top of the base price once we confirm your unit and destination.

Can I run multiple models at once?

Yes. The 128 GB of unified memory plus 24 GB of VRAM lets you keep a hot model resident in VRAM while a second sits in unified memory ready to swap in. The CLI handles loading and routing per request.

Lucebox: the plug-and-play computer for local AI inference

Name: Lucebox
Brand: Lucebox
Price: 4900 USD
Availability: PreOrder

The gap

Nothing else was built for local AI inference.

Most machines pair a stock GPU with a desktop CPU and run stock inference, missing the GPU-and-unified-memory pairing local AI needs. Four to six times the real throughput, left on the table.

01

Hardware not designed for inference.

A GPU bolted to a desktop CPU serves desktops, not LLMs. No pairing of fast VRAM with large unified memory for context.

02

No software tuned to the chip.

Today's runtimes run one generic path on every GPU, with no kernel-level tuning for the silicon underneath. Most of what the hardware can actually do is left on the table.

03

Four to six times slower than they should be.

On the same 27B model, a DGX Spark or Mac Studio leaves four to six times the real throughput on the table, simply because the stack was never tuned to the silicon.

We know. That's why we built Lucebox.

We pair the silicon the way local AI needs and hand-tune the engine to the chip. The box we always wanted, built for four-to-six-times the throughput.

The machine

Strix Halo and RTX 3090, in one small box.

Memory

128 GB unified

Big models and long context in memory.

Throughput

RTX 3090, 24 GB

10,496 CUDA cores for raw tok/s.

Ready

Tuned and pre-loaded

Pre-loaded and tuned. Plug in, go.

Price

$5,900 $4,900 -17%

Launch price, ends June 30.

Lucebox machine, front three-quarter view

Lucebox machine, side view with rear vent

Specs & design

Premium aluminium, 9.56 litres.

A 9.56-litre chassis machined from premium aluminium, designed and built by hand, in-house.

128 GB LPDDR5X Unified on the Ryzen AI MAX+ 395, 16 cores, ~256 GB/s.

24 GB GDDR6X RTX 3090, 10,496 CUDA cores at 936 GB/s.

2 TB NVMe SSD Fast on-device storage for models and data.

Full local privacy Inference stays on the box. Data never leaves.

500 W thermal system About 500 W, cooled for sustained inference.

Corsair PSU 750 W 80+ Gold, headroom to spare, 24/7 ready.

Custom PCIe x4 GPU to SoC over PCIe Gen4 x4, custom riser.

Zero setup Pre-loaded and tuned. Pair over Bluetooth, go.

Lucebox technical blueprint: GPU, APU, memory bandwidth and chassis dimensions

Performance

Fastest tok/s in its price range.

Hand-tuned inference, custom CUDA kernels, speculative decoding. Real numbers on real hardware, no synthetic scores.

Throughput · Qwen3.5-27B Q4_K_M

Up to 207 tok/s

DFlash on ggml, ~130 tok/s sustained, vs a DGX Spark or Mac Studio (estimated) on the same model class.

4–6×faster

Competitor figures are estimates for the same model class.

Cost · USD over 2 years, 8h/day

$5,900 $4,900 once

The full machine, one fixed cost, against equivalent cloud spend over two years at 8h/day.

5×cheaper

Cloud: Claude Sonnet 4.5 ($3 / $15 per 1M), GPT-4o list.

Open source

Proudly open source, and built in public.

Fully open in software and built in the open. Loved by a community of builders, with thousands of developers using and contributing to our repo already, and counting.

A

Ahmad @TheAhmadOsman

I like your stuff so far, keep going

S

Sudo su @sudoingX

this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026.

L

Lotto @LottoLabs

Interesting run w/ Dflash from the lucebox-hub guys

R

Rijndael @rot13maxi

speculative PREFILL?????

R

Riku Pasonen 🌞 @Raitziger

I have tested some LLM server software for home PCs for Linux and Windows. Fastest and best for running home is Linux running 145 t/s, Lucebox. @pupposandro @luceboxai

f

fahd Mirza @fahdmirza

💥 PFlash just killed the 4-minute blank screen problem. 🚀 128K token prefill in 25 seconds, same GPU, same model, no compromises

G

Geek Lite @QingQ77

Consumer-grade GPUs actually have sufficient hardware potential, general-purpose frameworks just waste most of it on overhead. Lucebox releases that potential through hand-written kernels, letting even a 2020 RTX 3090 rival Apple's latest chips on efficiency.

T

Takyon∞ @Takyon

Crazy I was litteraly wondering how can I increase my token speed 10 min ago

K

Kyz2ren @ky2renzz

Crazy what @pupposandro just dropped on Qwen3.5-27B. 207 tok/s on a single 3090 with Q4_K_M and full DFlash speculative? Chinese labs + ggml hacks are just cooking on consumer hardware right now. This is the kind of local win I like to see.

I

Ivan Fioravanti @ivanfioravanti

RTX 3090 ready for a new life! Bringing it to @luceboxai team to make some experiments together 😎

v

vitalik.eth @VitalikButerin

github.com/Luce-Org/lucebox-hub looks promising as a way to run "dense" models (eg. Qwen 27B) more efficiently. It's janky, but on my 5090 laptop it seems to be ~2x more tok/s than llama.cpp

C

CAPET ☀️ @Capetlevrai

This is very very good work BRAVO. I love it

n

nash_su @nash_su

Nearly 10x faster! After finishing Decoding, it starts cranking through Prefill. The previous DFlash was already stunning enough, and now they've added PFlash. Speculative prefill, up to 10x speedup. Go try it right now.

A

AJ @ItsmeAjayKV

First time reading about speculative prefill, and it's crazy. 257s down to 24s for a 128K prompt on a single RTX 3090. Great article, definitely go ahead and give this a read.

T

Twon. @Web3Twon

impressive... so this is what it looks like when you focus on a set ram limit and optimizing for a single model above everything

What you get

Plug-and-play, end to end.

Plug in, pair over Bluetooth, point your tools at it. From box to agent in about a minute.

 $ brew install lucebox && luce   ░██                                          ░██  ░██         ░██    ░██  ░███████   ░███████  ░████████   ░███████  ░██    ░██  ░██         ░██    ░██ ░██    ░██ ░██    ░██ ░██    ░██ ░██    ░██  ░██  ░██  ░██         ░██    ░██ ░██        ░█████████ ░██    ░██ ░██    ░██   ░█████  ░██         ░██   ░███ ░██    ░██ ░██        ░███   ░██ ░██    ░██  ░██  ░██  ░██████████  ░█████░██  ░███████   ░███████  ░██░█████   ░███████  ░██    ░██   Looking for your LuceBox via Bluetooth...  Found: LuceBox-A7F3  Scanning WiFi networks...  Available networks: home-5g, home-2.4, guest, xfinitywifi  Select (1-4): 1  Password: ●●●●●●●●  Connecting to home-5g...  ✓ Connected. Dashboard at http://lucebox.local  ✓ LuceBox is ready.   ╭──────────────────────────────────────────────────────────────╮  │ LuceBox │  │  │  │ Model: Qwen3.6-27B Q4_K_M │  │ LLM: ● online │  │ Endpoint: http://lucebox.local:8080 │  │  │  │ Commands │  │  │  │ /status System overview │  │ /models List models loaded │  │ /engine Runs our lucebox-hub engine │  │ /harness Switch agent harness │  │ /shell Open SSH shell │  │ /help All commands │  │  │  │ Type a message, or / for commands │  ╰──────────────────────────────────────────────────────────────╯   > Hey, I'm your Lucebox. How can I help you?▊ 

Zero setup, instant go.

No CUDA install, no Docker, no model downloads to babysit. Plug in, pair over Bluetooth, and your model is already loaded in VRAM. About a minute from box to first token.

Works with the tools you already use.

Our harness ships launchers and regression tests for the agents below. Any OpenAI or Anthropic compatible client just works. Local by default, cloud fallback on demand.

The chips we'd pick ourselves.

RTX 3090 with 24GB GDDR6X next to the Ryzen AI MAX+ 395 and 128GB LPDDR5X unified memory, over a custom 90-degree PCIe Gen4 x4 riser. 35B models sit entirely in VRAM.

Natively supports

Claude Code

Codex

OpenCode

Hermes

OpenClaw

Open WebUI

Ollama

Every machine, battle-tested.

We do not ship anything we have not run, every box is rebuilt and proven before it leaves.

Refurbished to like-new

Every RTX 3090 is fully disassembled, repasted, and re-padded, then benchmarked until it performs like a new card, with many years of life ahead.

Thermals proven, no surprises

We measure the box under sustained inference and confirm there are no hotspots and no throttling. The cooling is sized for the real load, not a spec sheet.

72-hour max stress test

We push every machine to its limits for three full days, real inference and memory under maximum load, so any weak part fails here on our bench, never later on your desk.

01Disassembly & inspection

02Die repaste & VRAM pads

03Thermal & memory burn-in

04LLM benchmarking

051-year warranty issued

One-year hardware warranty. Full system, parts and labor. A refurbished RTX 3090 has years of life left in it, and if anything fails under normal use we repair or replace it, no questions.

Apply

Apply to reserve your Lucebox.

A strictly limited first production batch, and we build only a handful at a time. $4,900 USD per machine until June 30, then the price moves to $5,900. Applying is free, you only pay once we reach out with the details and you decide to go ahead. Applications are reviewed daily, and multi-unit orders get priority on selection.

Launch price $4,900 ~~$5,900~~ 17% off

Ends June 30 · --d --h --m --s

1

Apply with your use case. Free, about a minute.

2

We review daily and email you the details.

3

Pay only if you choose to go ahead.

Thesis

Why Local AI, and why now.

01

Tokens are getting expensive.

Coding agents now burn $100 to $1000 per developer every month. Frontier model prices keep climbing as demand grows. Local inference flips the math: fixed hardware cost, unlimited tokens, predictable budget.

02

Privacy is non-negotiable.

Regulated industries cannot ship customer data, source code, or patient records to a cloud LLM. CTOs need an air-gapped option that runs on-prem and stays inside the building. Lucebox is that option, end to end.

03

Open source AI must win.

Qwen3.6, GLM-4.6, and DeepSeek V4-Flash are catching up to closed frontier models fast. Open weights mean no vendor lock-in, real auditability, and a stack you actually own. The hardware to run them well is the missing piece.

Engineering Blog

The computer for local AI.