June 2026
Lucebox in a container: one image for every supported GPU
Getting a Lucebox server running used to mean clone with submodules, install uv + CMake + a CUDA toolkit, sit through a ~25-minute fat-binary compile, then hand-fetch ~17 GB of weights and wire the flags. The Docker stack collapses that to two host dependencies (docker + nvidia-smi) and one prebuilt image that runs every card from the RTX 2080 Ti to the RTX 5090. docker run --gpus all and you have an OpenAI-compatible server. The compile now happens once in CI, not on your box.
TL;DR
- One prebuilt image, no build step.
ghcr.io/luce-org/lucebox-hub:cuda12.docker run --gpus all -p 8080:8080gives you the native DFlash server on an OpenAI-compatible port. No clone, no submodules, no nvcc. - One fat binary, six architectures. sm_75 (RTX 2080 Ti) through sm_120 (RTX 5090), CUDA 12.8. The kernels are compiled once in CI for all six and shipped in a single image, so the ~25-minute compile never lands on a user's machine.
- Two host dependencies. The
luceboxwrapper needs nothing but docker + nvidia-smi. No host Python, uv, CMake or huggingface-cli. Every orchestration command (config, model download, autotune, smoke, bench) runs inside the image. - Foreground or a service.
lucebox serveruns it in front of you;lucebox installwrites a user systemd unit andlucebox start/status/logsmanage it. Prefer raw docker?lucebox print-runprints the exact command and gets out of the way. - Self-tuning out of the box. Run the image with no env vars and the entrypoint picks a VRAM-tiered config: ~112K context with TQ3_0 KV on a 24 GB card, full 128K on 32 GB and up.
- Provenance baked in. Git SHA, image tag and build time are stamped into the image and surfaced at
/props.build; host facts (GPU, driver, OS) at/props.host. You always know exactly which build and which silicon answered a request.
The build was the barrier
Everything Lucebox publishes is fast, and almost none of it was easy to start. The decode engine is C++/CUDA on top of a pinned llama.cpp fork, pulled in as a submodule. To run it you cloned recursively, installed a Python toolchain, configured CMake against a CUDA 12+ toolkit, and built. On a single architecture that is about three minutes of nvcc. To get a binary that runs on any supported card, you compile the kernels for all of them, and the template-heavy CUDA translation units turn that into roughly twenty-five minutes. Then you still had to fetch ~17 GB of GGUF weights by hand and pass the right DFLASH_* flags for your VRAM.
None of that is hard if you build CUDA software for a living. All of it is friction if you just want to point an agent at a local endpoint. The Docker stack moves the entire build to CI and ships the result.
The image is about 14 GB to pull once. In exchange the compile that every user used to pay, every time they set up a box or moved to a new card, drops to zero. The kernels are already in the image for all six architectures.
One image, every GPU
The image is a fat binary. CMake is handed the full architecture list and nvcc emits device code for each, so a single :cuda12 image dispatches the right kernels on whatever card it lands on. No per-host arch detection, no rebuild when you swap GPUs.
| Arch | GPUs | sm | cuda12 |
|---|---|---|---|
| Turing | RTX 2080 Ti | 75 | ✓ |
| Ampere | A100 | 80 | ✓ |
| Ampere | RTX 3090 / A40 / A10 | 86 | ✓ |
| Ada | RTX 4090 / L40 | 89 | ✓ |
| Hopper | H100 | 90 | ✓ |
| Blackwell | RTX 5090 / 5090 Laptop | 120 | ✓ |
Each architecture adds roughly 50 to 200 MB of kernel code and 3 to 5 minutes of nvcc per translation unit, which is exactly why doing it once in CI matters. Pre-Turing cards (Pascal sm_60/61, Volta sm_70) are intentionally left out: DFlash's BF16/WMMA paths assume sm_75 and have no fallback below it. DGX Spark (GB10) and Jetson Thor live on different CUDA stacks and stay out of this image. AMD is its own track: the same DFlash + PFlash stack runs on Strix Halo (gfx1151) and RDNA3 (gfx1100) through the HIP backend, but that is a from-source build today rather than part of this CUDA image (see the Strix Halo post).
Under the hood it is a two-stage build. The first stage is the CUDA -devel image: it pulls submodules, configures CMake, and compiles the server plus tests. The second stage is the much smaller -runtime image with no nvcc and no headers. Only the binaries and the ggml shared libraries cross over, and the build tree is pruned first (about 1 GB of object files, static archives and CMake state dropped per image). A few sharp edges had to be filed down to make that split work:
- A libcuda stub for the link. The
-develimage ships the driver stub aslibcuda.sobut notlibcuda.so.1, which is the SONAME the linker chases. A symlink plus anld.so.confentry resolves the build-time symbols; at runtime the real driver arrives through--gpus all. - RPATH that survives the COPY. The binaries embed a
$ORIGIN-relative RPATH so they still find ggml's shared libs after they are copied into the runtime stage. Anld.so.conf.dentry covers the loader-path gap for the transitively loaded libraries. - Runs as your UID. Python and the venv are installed world-readable and the container runs as the invoking host UID, so bind-mounted
config.tomland profile files come out user-owned, not root-owned.
Two dependencies on the host
The supported way to drive the image is the lucebox host wrapper. It is a shell script with no dependency beyond docker and nvidia-smi: it probes your driver and GPU, selects the image, and either runs the server in the foreground or manages it as a user systemd service. Everything with real logic, config parsing, the model registry, the autotune sweep, the benchmarks, lives in a typed Python CLI inside the image, so the host stays clean.
# Install the wrapper. The installer records where it came from, so a later
# `lucebox update` re-pulls from the same channel (canonical, fork, branch).
curl -fsSL https://raw.githubusercontent.com/Luce-Org/lucebox-hub/main/install.sh | bash
lucebox check # driver, docker, NVIDIA Container Toolkit, VRAM, systemd
lucebox pull # the image (~14 GB)
lucebox models download # default target + DFlash draft (~17 GB), via the container
lucebox serve # foreground
# or run it as a service:
lucebox install # writes ~/.config/systemd/user/lucebox.service
lucebox start && lucebox logs
curl http://localhost:8080/v1/models If you would rather own the command, lucebox print-run emits the exact docker run line without executing it. Or skip the wrapper entirely:
docker run --rm --gpus all -p 8080:8080 \
-v "$PWD/models:/opt/lucebox-hub/server/models" \
ghcr.io/luce-org/lucebox-hub:cuda12 The same entrypoint also takes shell for a debug prompt, lucebox <subcommand> to reach the in-container CLI for anything the host script does not handle directly, and a pass-through for docker run ... python -m foo during development.
It tunes itself when you give it nothing
Run the image bare, with no DFLASH_* env vars, and the entrypoint does a minimal VRAM-tiered autotune before exec'ing the server: the same tiers as lucebox autotune, so a direct docker run and the wrapper land on the same config. On a 24 GB card that is roughly 112K context with TQ3_0 KV; on 32 GB and up it opens to the full 128K. WSL2 on 24 GB-class cards gets safer defaults (DFLASH_MAX_CTX=65536, DFLASH_BUDGET=16) because stress testing showed the aggressive setting can leave only a few hundred MiB of headroom under sustained tool traffic, which is not enough for CUDA scratch allocations.
Switching models is one command against a small registry. lucebox models download <preset> fetches the GGUFs and flips the active preset in config.toml, so the entrypoint never has to guess which weights to load on a box that has several model families on disk.
| Preset | Target | DFlash draft |
|---|---|---|
qwen3.6-27b | Qwen3.6-27B Q4_K_M | Qwen3.6-27B-DFlash |
gemma-4-26b | Gemma-4-26B-A4B Q4_K_M | gemma-4-26B-A4B-it-DFlash |
gemma-4-31b | Gemma-4-31B Q4_K_M | gemma-4-31B-it-DFlash |
laguna-xs.2 | Laguna-XS.2 Q4_K_M | target-only |
Want a tuned config instead of the heuristic? lucebox autotune --sweep brackets candidate DFLASH_* configs for your VRAM tier, cycles the live server through each via restart, measures decode tok/s with the in-image bench, and writes the fastest cell back to config.toml. The pre-sweep config is backed up and restored on interrupt.
Provenance: which build, which silicon
For a tool people install on machines they ship to customers, "which exact build is running" is not a nice-to-have. The build stamps three values into the image at build time (git SHA, image tag, build time) and the server reads them at startup and exposes them at /props.build. The entrypoint separately writes a JSON sidecar of host facts (GPU model and count, driver, OS, kernel) that surfaces at /props.host. Run a request, query /props, and you know precisely which image and which card answered it. Missing fields (a bare docker build with no CI metadata) come through as JSON null rather than breaking the read.
The release pipeline
The image is wired to CI, not built by hand. A GitHub Actions workflow builds and pushes to GHCR on every merge to main (the rolling :cuda12 tag), and tags a pinned :vX.Y.Z-cuda12 on published releases. Pull requests that touch the Docker surface get a build-only guard so an arch-list or Dockerfile regression is caught before it lands, and forked-PR tokens never get push access. Because the fat-binary build is CPU-heavy on a hosted runner, the job first reclaims about 30 GB by stripping the preinstalled Android/.NET/Haskell toolchains, which is what lets a 14 GB image plus build cache fit.
| Tag | Points at |
|---|---|
:cuda12 | rolling latest, tracks main |
:vX.Y.Z-cuda12 | a specific release |
:X.Y-cuda12 | latest patch in a minor series |
:sha-<short>-cuda12 | an exact commit |
One thing the image does not yet carry is the megakernel: its CUDA extension links against a torch cpp_extension wheel at build time and has to be compiled in your venv, so megakernel benchmarks stay a from-source path for now. Everything in the DFlash + DDTree + PFlash decode stack ships in the image.
Bottom line
The fast part of Lucebox was never the problem. The setup was. A prebuilt image that runs from the RTX 2080 Ti to the RTX 5090, a wrapper that needs nothing but docker and nvidia-smi, a server that tunes itself from your VRAM, and provenance you can query, turn "build the CUDA stack and hope it picks the right flags" into docker run --gpus all. The kernels still get hand-tuned per architecture. You just stop paying for it on every box. On a local-inference PC that is the difference between an afternoon of setup and a single pull.
Source: the Docker stack on github.com/Luce-Org/lucebox-hub (Dockerfile, docker-bake.hcl, server/scripts/entrypoint.sh, .github/workflows/docker.yml). Image at ghcr.io/luce-org/lucebox-hub:cuda12, CUDA 12.8.1 on Ubuntu 22.04, arches sm_75;80;86;89;90;120. Compile-time and image-size figures are from the build itself.