fix: store Flux-dev transformer as fp8 to fit the 3090's VRAM#965
Merged
Conversation
imagine_win.py loaded the flux1-*-fp8 checkpoint with torch_dtype=bfloat16, upcasting the fp8 weights to ~23GB. On a 24GB card that overflows once the desktop/browser baseline (~5GB) and activations are counted, so the driver spills to shared system RAM and each diffusion step takes minutes -- long enough to trip the media-job idle watchdog (300s), which SIGTERMs the job. Enable layerwise casting (storage fp8, compute bf16): weights stay ~15GB resident and upcast per-layer during the forward pass. Ampere/3090 has no fp8 tensor cores, so compute stays bf16. Verified: 1.6s/step (was 322s), valid 1024x1024 output. Toggle off with IMAGINE_WIN_FP8=0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Local Flux-dev/schnell image generation (the legacy Windows
imagine_win.pyrunner) was getting killed mid-render:Root cause:
imagine_win.pyloaded theflux1-*-fp8.safetensorscheckpoint withtorch_dtype=bfloat16, upcasting the fp8 weights to ~23 GB. On the 24 GB RTX 3090 — with ~5 GB already consumed by the Windows desktop/browsers — that overflows VRAM, the driver silently spills to shared system RAM, and each diffusion step balloons from ~1.6 s to 322 s. The media-job idle watchdog (300 s,server/services/mediaJobQueue/index.js) then correctly SIGTERMs the job as hung. The watchdog was the messenger, not the bug.Fix
Keep the transformer stored as fp8 and upcast per-layer to bf16 during the forward pass via
transformer.enable_layerwise_casting(storage_dtype=float8_e4m3fn, compute_dtype=bf16).Ampere (3090, sm_86) has no fp8 tensor cores (those start at Ada sm_89), so we store in fp8 but compute in bf16 — the correct strategy here, and a better fit than the
enable_model_cpu_offload()the sibling runners use (which would page the model in/out and run slower). Opt out withIMAGINE_WIN_FP8=0(orfalse/never).Verification
Ran real generations through the configured interpreter (
miniconda3python, torch 2.7.1+cu118):All silent load gaps stay ≤71 s and every step emits a progress line, so the idle watchdog can no longer fire. Image quality is unaffected (fp8-storage / bf16-compute is the established ComfyUI-default technique).
Scope
One self-contained legacy script (
scripts/imagine_win.py)./simplifyconfirmed reuse/altitude are correctly scoped (no shared fp8 helper exists; fp8-storage is the right technique vs the siblings' cpu-offload) and aligned the env-flag parsing with the codebase convention (_runner_common.py).🤖 Generated with Claude Code