[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4)#49
Open
dzhengAP wants to merge 2 commits intobytedance:mainfrom
Open
Conversation
added 2 commits
April 5, 2026 14:26
…uantization
## Motivation
The existing --quantize_8bit flag quantizes the FLUX transformer and T5
text encoder via optimum.quanto, but InfuseNet (the identity injection
side-network, ~6GB in bf16) is always loaded in full precision regardless.
This PR adds --infusenet_quant to quantize InfuseNet independently.
## Changes
pipelines/pipeline_infu_flux.py:
- Add qfloat8, qint4 to optimum.quanto imports
- Add optional bitsandbytes import (BNB_AVAILABLE flag, graceful fallback)
- Add quantize_infusenet parameter to InfUFluxPipeline.__init__
- NF4 mode: uses BitsAndBytesConfig + FluxControlNetModel.from_pretrained
for true 4-bit inference (weights stay in 4-bit during compute)
- fp8/int8/int4 modes: use optimum.quanto (weight-only, dequantizes to
bf16 for compute - useful for load-time memory, not peak inference VRAM)
test.py:
- Add --infusenet_quant {nf4,fp8,int8,int4} argument
- Pass through to InfUFluxPipeline constructor
## Benchmark results (NVIDIA H100 80GB HBM3, sm_90, seed=42, 10 steps)
| Config | Load VRAM | Peak VRAM | Delta |
|-----------------------------------------|-----------|-----------|----------|
| --quantize_8bit (baseline) | 19.26 GB | 21.54 GB | - |
| --quantize_8bit --infusenet_quant nf4 | 17.94 GB | 20.19 GB | -1.35 GB |
| --quantize_8bit --infusenet_quant int8 | 19.29 GB | 21.54 GB | ~0 |
| --infusenet_quant nf4 (FLUX bf16) | 33.21 GB | 35.47 GB | N/A |
| --infusenet_quant int8 (FLUX bf16) | 34.57 GB | 36.82 GB | N/A |
| --infusenet_quant fp8 (FLUX bf16) | 34.58 GB | 36.83 GB | N/A |
| --infusenet_quant int4 (FLUX bf16) | 33.26 GB | 35.52 GB | N/A |
Key finding: --quantize_8bit --infusenet_quant nf4 reduces peak inference
VRAM by 1.35 GB with no visible quality degradation.
NF4 (bitsandbytes) is the only mode that reduces peak inference VRAM
because weights stay in 4-bit during compute. optimum.quanto modes
(fp8/int8/int4) dequantize to bf16 for every matmul so peak activation
memory is unchanged - these modes reduce load-time and serialization
size only.
Effective combination: --quantize_8bit --infusenet_quant nf4
Using --infusenet_quant alone without --quantize_8bit does not help
because FLUX transformer in bf16 (~24 GB) dominates.
## Requirements
nf4 : pip install bitsandbytes
fp8/int8/int4: pip install optimum-quanto (already in requirements.txt)
## Backward compatibility
Default is None (bf16, unchanged behavior). All existing flags work as before.
Signed-off-by: David Zheng <dqzheng1996@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
InfuseNet is architecturally a full DiT side-network that mirrors the FLUX transformer — at ~6 GB in bf16, it's the second largest component after FLUX itself (~24 GB). Unlike IP-Adapter approaches that inject identity through attention layers, InfuseNet runs a complete parallel forward pass at every denoising step, making it a significant and persistent memory resident throughout inference.
The existing
--quantize_8bitflag covers FLUX + T5 but leaves InfuseNet in bf16, which means users who need memory reduction are only getting partial coverage. Since InfuseNet is frozen at inference time (pure forward-pass, no gradients), it's an ideal PTQ target — the same reason the FLUX transformer quantizes cleanly.The motivation is completing the quantization story: if a user is already running
--quantize_8bitto fit within a memory budget, they should also be able to quantize InfuseNet with the same ease.Usage
Benchmark (H100 80GB, sm_90, seed=42, 10 steps)
--quantize_8bit(baseline)--quantize_8bit --infusenet_quant nf4--quantize_8bit --infusenet_quant int8--infusenet_quant nf4(FLUX bf16)--infusenet_quant int8(FLUX bf16)--infusenet_quant fp8(FLUX bf16)--infusenet_quant int4(FLUX bf16)Comparison Image
Notes
--quantize_8bit --infusenet_quant nf4togetherNone(bf16, fully backward compatible)Requirements
pip install bitsandbytesoptimum-quanto(already in requirements.txt)cc: @EndlessSora @YuminJia @lark @helloworld575