Skip to content

fix: store Flux-dev transformer as fp8 to fit the 3090's VRAM#965

Merged
atomantic merged 2 commits into
mainfrom
fix/imagine-win-fp8-vram
Jun 5, 2026
Merged

fix: store Flux-dev transformer as fp8 to fit the 3090's VRAM#965
atomantic merged 2 commits into
mainfrom
fix/imagine-win-fp8-vram

Conversation

@atomantic
Copy link
Copy Markdown
Owner

Problem

Local Flux-dev/schnell image generation (the legacy Windows imagine_win.py runner) was getting killed mid-render:

⏱️ media-job [94237fcc] watchdog fired after 321750ms idle (limit 300000ms) — marking failed
❌ Image generation failed [94237fcc]: Killed by signal SIGTERM

Root cause: imagine_win.py loaded the flux1-*-fp8.safetensors checkpoint with torch_dtype=bfloat16, upcasting the fp8 weights to ~23 GB. On the 24 GB RTX 3090 — with ~5 GB already consumed by the Windows desktop/browsers — that overflows VRAM, the driver silently spills to shared system RAM, and each diffusion step balloons from ~1.6 s to 322 s. The media-job idle watchdog (300 s, server/services/mediaJobQueue/index.js) then correctly SIGTERMs the job as hung. The watchdog was the messenger, not the bug.

Fix

Keep the transformer stored as fp8 and upcast per-layer to bf16 during the forward pass via transformer.enable_layerwise_casting(storage_dtype=float8_e4m3fn, compute_dtype=bf16).

Ampere (3090, sm_86) has no fp8 tensor cores (those start at Ada sm_89), so we store in fp8 but compute in bf16 — the correct strategy here, and a better fit than the enable_model_cpu_offload() the sibling runners use (which would page the model in/out and run slower). Opt out with IMAGINE_WIN_FP8=0 (or false/never).

Verification

Ran real generations through the configured interpreter (miniconda3 python, torch 2.7.1+cu118):

Metric Before (bf16) After (fp8 storage)
Resident transformer ~23 GB 15.2 GB (measured)
Per diffusion step 322 s (spilled) 1.6 s
Outcome watchdog SIGTERM completes, valid 1024×1024 PNG

All silent load gaps stay ≤71 s and every step emits a progress line, so the idle watchdog can no longer fire. Image quality is unaffected (fp8-storage / bf16-compute is the established ComfyUI-default technique).

Note: at 15 GB resident + ~5 GB desktop baseline, 1024×1024 dev is still borderline — step time can vary 1.6 s–34 s under activation pressure, but it always completes now. Closing Chrome/Edge frees the headroom for consistently fast renders.

Scope

One self-contained legacy script (scripts/imagine_win.py). /simplify confirmed reuse/altitude are correctly scoped (no shared fp8 helper exists; fp8-storage is the right technique vs the siblings' cpu-offload) and aligned the env-flag parsing with the codebase convention (_runner_common.py).

🤖 Generated with Claude Code

atomantic added 2 commits June 5, 2026 16:51
imagine_win.py loaded the flux1-*-fp8 checkpoint with torch_dtype=bfloat16,
upcasting the fp8 weights to ~23GB. On a 24GB card that overflows once the
desktop/browser baseline (~5GB) and activations are counted, so the driver
spills to shared system RAM and each diffusion step takes minutes -- long
enough to trip the media-job idle watchdog (300s), which SIGTERMs the job.

Enable layerwise casting (storage fp8, compute bf16): weights stay ~15GB
resident and upcast per-layer during the forward pass. Ampere/3090 has no
fp8 tensor cores, so compute stays bf16. Verified: 1.6s/step (was 322s),
valid 1024x1024 output. Toggle off with IMAGINE_WIN_FP8=0.
@atomantic atomantic merged commit 51fd5af into main Jun 5, 2026
2 checks passed
@atomantic atomantic deleted the fix/imagine-win-fp8-vram branch June 5, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant