June 2026

Lucebox in a container: one image for every supported GPU

Until now the only way to run Lucebox was to build it yourself: clone the repo with its submodules, install uv, CMake and a CUDA toolkit, wait around 25 minutes while nvcc compiled kernels for six GPU architectures, then download the weights and figure out the right flags. None of that is hard, but it is a lot of steps if you just want to try the server. There are now prebuilt images on GHCR, one for NVIDIA cards from the RTX 2080 Ti to the RTX 5090 and one for AMD starting with Strix Halo, so the whole thing becomes a pull, a mount and a docker run.

Lucebox shipping as a single Docker image that runs across the whole supported GPU range

TL;DR

The images. ghcr.io/luce-org/lucebox-hub:cuda12 for NVIDIA, :rocm for AMD. docker run -p 8000:8080 with your models mounted gives you the DFlash server on an OpenAI-compatible port. No clone, no submodules, no nvcc.
One binary for every supported NVIDIA card. sm_75 (RTX 2080 Ti) through sm_120 (RTX 5090), CUDA 12.8. The kernels are compiled once in CI and the same image runs on whichever card it lands on.
AMD in the same registry. :rocm is the HIP build of the same sources, covering gfx1151 (Strix Halo / Ryzen AI MAX) plus RX 7900 and RDNA4. You swap --gpus all for --device /dev/kfd --device /dev/dri and the rest of the command stays the same.
Nothing to install on the host. Just docker and its GPU runtime (the NVIDIA Container Toolkit, or the kernel's amdgpu driver on AMD). Weights live in a directory you mount, and the entrypoint finds the target and the draft on its own.
Sensible defaults. Run the image with no env vars and the entrypoint picks a working config for your VRAM. Anything you set yourself wins over the heuristic.

The setup was the annoying part

The decode engine is our own C++/CUDA server. It pulls in a pinned llama.cpp fork as a submodule, but only for the ggml tensor library and GGUF loading underneath; the serving stack (DFlash, DDTree, PFlash) is ours. Until now, running it meant cloning recursively, installing a Python toolchain, configuring CMake against a CUDA 12+ toolkit, and building. For a single GPU architecture that is about three minutes of nvcc, which is fine. But a binary that runs on any supported card needs the kernels compiled for all of them, and the template-heavy CUDA translation units stretch that to roughly twenty-five minutes. After the build you still had to download the weights by hand and work out the right DFLASH_* flags for your VRAM.

None of this was hard if you build CUDA software regularly, but it was a lot of friction for someone who only wanted to point an agent at a local endpoint.

The cuda12 image is about 18 GB on disk after a one-time pull, and in exchange the compile drops to zero.

One image per vendor

The cuda12 image is a fat binary: nvcc emits device code for every architecture in the list, and the right kernels get picked at runtime. Nothing to detect, nothing to rebuild if you swap GPUs.

Arch	GPUs	sm	cuda12
Turing	RTX 2080 Ti	75	✓
Ampere	RTX 3090	86	✓
Ada	RTX 4090	89	✓
Blackwell	RTX 5090 / 5090 Laptop	120	✓

Each architecture adds roughly 50 to 200 MB of kernel code and a few more minutes of nvcc, which adds up quickly when you compile for all of them. Pre-Turing cards (Pascal sm_60/61, Volta sm_70) are intentionally left out: DFlash's BF16/WMMA paths assume sm_75 and have no fallback below it. DGX Spark (GB10) and Jetson Thor live on different CUDA stacks and stay out of this image.

:rocm is the HIP build of the same server sources via Dockerfile.rocm. It covers gfx1151 (Strix Halo / Ryzen AI MAX, the chip from the Strix Halo post), gfx1100 (RX 7900) and gfx1200 (RDNA4); other targets can be added through the DFLASH_HIP_ARCHES build arg. Two things worth knowing before you pull it. Block-Sparse-Attention is a CUDA-only kernel set, so PFlash's block-sparse path is off in the HIP image. And the ROCm userspace inside the container should match your host driver's major version: the published base is ROCm 6.4.1, and on a host running a 7.x driver it can segfault while loading the model. We hit exactly this on our own Strix Halo box, so we also publish a :rocm-7.2 tag built on the 7.2.2 base; pull that one if your host is on a 7.x driver, or rebuild with ROCM_VERSION=7.2.2 docker buildx bake rocm-local.

Under the hood it is a two-stage build. The first stage is the CUDA -devel image: it pulls submodules, configures CMake, and compiles the server plus tests. The second stage is the much smaller -runtime image with no nvcc and no headers. Only the binaries and the ggml shared libraries cross over, and the build tree is pruned first (about 1 GB of object files, static archives and CMake state dropped per image).

Nothing on the host but docker

There is no host toolchain and nothing to install beyond docker itself. The image carries the compiled server and its Python runtime; the host contributes docker, the GPU runtime (the NVIDIA Container Toolkit, or the in-kernel amdgpu driver on AMD) and a directory with your weights in it. It comes down to three steps, where the third depends on your GPU vendor:

# 1. Pull the image for your GPU
docker pull ghcr.io/luce-org/lucebox-hub:cuda12   # NVIDIA
docker pull ghcr.io/luce-org/lucebox-hub:rocm     # AMD

# 2. Target model into server/models/, DFlash draft into server/models/draft/
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf \
  --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q4_k_m.gguf \
  --local-dir server/models/draft/

# 3a. NVIDIA (CUDA 12+)
docker run --rm --gpus all -p 8000:8080 \
  -v "$PWD/server/models:/opt/lucebox-hub/server/models" \
  ghcr.io/luce-org/lucebox-hub:cuda12

# 3b. AMD (ROCm, Strix Halo / RX 7900)
docker run --rm --device /dev/kfd --device /dev/dri \
  --group-add video --group-add render --security-opt seccomp=unconfined \
  -p 8000:8080 -v "$PWD/server/models:/opt/lucebox-hub/server/models" \
  ghcr.io/luce-org/lucebox-hub:rocm

curl http://localhost:8000/v1/models

One detail that is easy to miss: the draft has its own directory. The target GGUF goes in server/models/, the DFlash draft in server/models/draft/. Without the draft the server still works, just target-only and noticeably slower. docker run ... shell gets you a prompt inside the image.

Defaults that fit your card

With no DFLASH_* env vars set, the entrypoint looks at your VRAM and picks a config: roughly 112K of context with TQ3_0 KV cache on a 24 GB card, the full 128K on 32 GB and up. WSL2 on 24 GB cards gets more conservative defaults, because in our stress tests the aggressive setting left too little headroom for CUDA scratch allocations under sustained tool traffic. Any env var you set yourself wins over the heuristic.

Switching models means swapping the GGUFs in the mounted directory. The entrypoint loads whichever target it finds in server/models/, pairs it with the matching draft from server/models/draft/, and reads per-model serving defaults from the model-card files that ship inside the image. The combinations we currently run:

Target	DFlash draft
Qwen3.6-27B Q4_K_M	Qwen3.6-27B-DFlash
Gemma-4-26B-A4B Q4_K_M	gemma-4-26B-A4B-it-DFlash
Gemma-4-31B Q4_K_M	gemma-4-31B-it-DFlash
Qwen3.6-35B-A3B Q4_K_M	target-only
Laguna-XS.2 Q4_K_M	target-only

Knowing which build is running

When something misbehaves on a box you cannot see, the first question is "which build is this, exactly". CI stamps the git SHA, image tag and build time into the image and the server exposes them at /props.build; host facts (GPU, driver, OS, kernel) show up at /props.host.

The release pipeline

The images come out of CI, and we rebuild only when something that actually reaches the image changes: the Dockerfiles, the bake file, server sources, the lockfile. A merge that only touches docs or tooling does not produce a new image, so the rolling :cuda12 and :rocm tags move when the server changes and stay put otherwise. Published releases add a pinned :X.Y.Z-<variant> tag. Pull requests that touch the Docker surface get a build-only run as a guard, and tokens from forked PRs never get push access.

Tag	Points at
`:cuda12` / `:rocm`	rolling latest, tracks main
`:X.Y.Z-cuda12`	a specific release
`:X.Y-cuda12`	latest patch in a minor series
`:sha-<short>-cuda12`	an exact commit

One thing the image does not yet carry is the megakernel: its CUDA extension links against a torch cpp_extension wheel at build time and has to be compiled in your venv, so megakernel benchmarks stay a from-source path for now. Everything in the DFlash + DDTree + PFlash decode stack ships in the image.

Bottom line

The setup is now: pull the image for your GPU, put the weights in a directory, docker run. The compile happens once in CI instead of on every box, the same stack runs on NVIDIA and AMD, and the server picks workable defaults from your VRAM. If you tried Lucebox before and gave up somewhere between CMake and the driver stub, this removes that part. It is the difference between an evening of setup and a few minutes of downloading.

Source: the Docker stack on github.com/Luce-Org/lucebox-hub (Dockerfile, Dockerfile.rocm, docker-bake.hcl, server/scripts/entrypoint.sh, .github/workflows/docker.yml). Images at ghcr.io/luce-org/lucebox-hub:cuda12 (CUDA 12.8.1, Ubuntu 22.04, arches sm_75;80;86;89;90;120) and :rocm (ROCm 6.4.1, Ubuntu 22.04, gfx1151). Compile-time and image-size figures are from the build itself; the AMD runs were verified on a Strix Halo box.

Run Lucebox in one pull

Open-source. One image, every supported GPU. No build step.

GitHub Harnesses post Compare hardware Discord