fix: store Flux-dev transformer as fp8 to fit the 3090's VRAM by atomantic · Pull Request #965 · atomantic/PortOS

atomantic · 2026-06-05T16:52:16Z

Problem

Local Flux-dev/schnell image generation (the legacy Windows imagine_win.py runner) was getting killed mid-render:

⏱️ media-job [94237fcc] watchdog fired after 321750ms idle (limit 300000ms) — marking failed
❌ Image generation failed [94237fcc]: Killed by signal SIGTERM

Root cause: imagine_win.py loaded the flux1-*-fp8.safetensors checkpoint with torch_dtype=bfloat16, upcasting the fp8 weights to ~23 GB. On the 24 GB RTX 3090 — with ~5 GB already consumed by the Windows desktop/browsers — that overflows VRAM, the driver silently spills to shared system RAM, and each diffusion step balloons from ~1.6 s to 322 s. The media-job idle watchdog (300 s, server/services/mediaJobQueue/index.js) then correctly SIGTERMs the job as hung. The watchdog was the messenger, not the bug.

Fix

Keep the transformer stored as fp8 and upcast per-layer to bf16 during the forward pass via transformer.enable_layerwise_casting(storage_dtype=float8_e4m3fn, compute_dtype=bf16).

Ampere (3090, sm_86) has no fp8 tensor cores (those start at Ada sm_89), so we store in fp8 but compute in bf16 — the correct strategy here, and a better fit than the enable_model_cpu_offload() the sibling runners use (which would page the model in/out and run slower). Opt out with IMAGINE_WIN_FP8=0 (or false/never).

Verification

Ran real generations through the configured interpreter (miniconda3 python, torch 2.7.1+cu118):

Metric	Before (bf16)	After (fp8 storage)
Resident transformer	~23 GB	15.2 GB (measured)
Per diffusion step	322 s (spilled)	1.6 s
Outcome	watchdog SIGTERM	completes, valid 1024×1024 PNG

All silent load gaps stay ≤71 s and every step emits a progress line, so the idle watchdog can no longer fire. Image quality is unaffected (fp8-storage / bf16-compute is the established ComfyUI-default technique).

Note: at 15 GB resident + ~5 GB desktop baseline, 1024×1024 dev is still borderline — step time can vary 1.6 s–34 s under activation pressure, but it always completes now. Closing Chrome/Edge frees the headroom for consistently fast renders.

Scope

One self-contained legacy script (scripts/imagine_win.py). /simplify confirmed reuse/altitude are correctly scoped (no shared fp8 helper exists; fp8-storage is the right technique vs the siblings' cpu-offload) and aligned the env-flag parsing with the codebase convention (_runner_common.py).

🤖 Generated with Claude Code

imagine_win.py loaded the flux1-*-fp8 checkpoint with torch_dtype=bfloat16, upcasting the fp8 weights to ~23GB. On a 24GB card that overflows once the desktop/browser baseline (~5GB) and activations are counted, so the driver spills to shared system RAM and each diffusion step takes minutes -- long enough to trip the media-job idle watchdog (300s), which SIGTERMs the job. Enable layerwise casting (storage fp8, compute bf16): weights stay ~15GB resident and upcast per-layer during the forward pass. Ampere/3090 has no fp8 tensor cores, so compute stays bf16. Verified: 1.6s/step (was 322s), valid 1024x1024 output. Toggle off with IMAGINE_WIN_FP8=0.

atomantic added 2 commits June 5, 2026 16:51

docs: correct fp8 resident footprint to measured ~15GB

280b8b1

atomantic merged commit 51fd5af into main Jun 5, 2026
2 checks passed

atomantic deleted the fix/imagine-win-fp8-vram branch June 5, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: store Flux-dev transformer as fp8 to fit the 3090's VRAM#965

fix: store Flux-dev transformer as fp8 to fit the 3090's VRAM#965
atomantic merged 2 commits into
mainfrom
fix/imagine-win-fp8-vram

atomantic commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atomantic commented Jun 5, 2026

Problem

Fix

Verification

Scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant