[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4) by dzhengAP · Pull Request #49 · bytedance/InfiniteYou

dzhengAP · 2026-04-06T01:32:01Z

Motivation

InfuseNet is architecturally a full DiT side-network that mirrors the FLUX transformer — at ~6 GB in bf16, it's the second largest component after FLUX itself (~24 GB). Unlike IP-Adapter approaches that inject identity through attention layers, InfuseNet runs a complete parallel forward pass at every denoising step, making it a significant and persistent memory resident throughout inference.

The existing --quantize_8bit flag covers FLUX + T5 but leaves InfuseNet in bf16, which means users who need memory reduction are only getting partial coverage. Since InfuseNet is frozen at inference time (pure forward-pass, no gradients), it's an ideal PTQ target — the same reason the FLUX transformer quantizes cleanly.

The motivation is completing the quantization story: if a user is already running --quantize_8bit to fit within a memory budget, they should also be able to quantize InfuseNet with the same ease.

Usage

# Recommended: 1.35 GB peak VRAM reduction, no quality loss
python test.py --quantize_8bit --infusenet_quant nf4 ...

# Other modes (reduce load-time/serialization size, not peak inference VRAM)
python test.py --infusenet_quant int8 ...
python test.py --infusenet_quant fp8  ...
python test.py --infusenet_quant int4 ...

Benchmark (H100 80GB, sm_90, seed=42, 10 steps)

Config	Load VRAM	Peak VRAM	Δ
`--quantize_8bit` (baseline)	19.26 GB	21.54 GB	—
`--quantize_8bit --infusenet_quant nf4`	17.94 GB	20.19 GB	-1.35 GB
`--quantize_8bit --infusenet_quant int8`	19.29 GB	21.54 GB	~0
`--infusenet_quant nf4` (FLUX bf16)	33.21 GB	35.47 GB	N/A
`--infusenet_quant int8` (FLUX bf16)	34.57 GB	36.82 GB	N/A
`--infusenet_quant fp8` (FLUX bf16)	34.58 GB	36.83 GB	N/A
`--infusenet_quant int4` (FLUX bf16)	33.26 GB	35.52 GB	N/A

Comparison Image

Notes

NF4 (bitsandbytes) is the only mode that reduces peak inference VRAM — weights stay in 4-bit during compute
fp8/int8/int4 (optimum.quanto) dequantize to bf16 for matmuls — reduce load-time memory only
Best used as --quantize_8bit --infusenet_quant nf4 together
Default is None (bf16, fully backward compatible)

Requirements

NF4: pip install bitsandbytes
fp8/int8/int4: optimum-quanto (already in requirements.txt)

cc: @EndlessSora @YuminJia @lark @helloworld575

…uantization ## Motivation The existing --quantize_8bit flag quantizes the FLUX transformer and T5 text encoder via optimum.quanto, but InfuseNet (the identity injection side-network, ~6GB in bf16) is always loaded in full precision regardless. This PR adds --infusenet_quant to quantize InfuseNet independently. ## Changes pipelines/pipeline_infu_flux.py: - Add qfloat8, qint4 to optimum.quanto imports - Add optional bitsandbytes import (BNB_AVAILABLE flag, graceful fallback) - Add quantize_infusenet parameter to InfUFluxPipeline.__init__ - NF4 mode: uses BitsAndBytesConfig + FluxControlNetModel.from_pretrained for true 4-bit inference (weights stay in 4-bit during compute) - fp8/int8/int4 modes: use optimum.quanto (weight-only, dequantizes to bf16 for compute - useful for load-time memory, not peak inference VRAM) test.py: - Add --infusenet_quant {nf4,fp8,int8,int4} argument - Pass through to InfUFluxPipeline constructor ## Benchmark results (NVIDIA H100 80GB HBM3, sm_90, seed=42, 10 steps) | Config | Load VRAM | Peak VRAM | Delta | |-----------------------------------------|-----------|-----------|----------| | --quantize_8bit (baseline) | 19.26 GB | 21.54 GB | - | | --quantize_8bit --infusenet_quant nf4 | 17.94 GB | 20.19 GB | -1.35 GB | | --quantize_8bit --infusenet_quant int8 | 19.29 GB | 21.54 GB | ~0 | | --infusenet_quant nf4 (FLUX bf16) | 33.21 GB | 35.47 GB | N/A | | --infusenet_quant int8 (FLUX bf16) | 34.57 GB | 36.82 GB | N/A | | --infusenet_quant fp8 (FLUX bf16) | 34.58 GB | 36.83 GB | N/A | | --infusenet_quant int4 (FLUX bf16) | 33.26 GB | 35.52 GB | N/A | Key finding: --quantize_8bit --infusenet_quant nf4 reduces peak inference VRAM by 1.35 GB with no visible quality degradation. NF4 (bitsandbytes) is the only mode that reduces peak inference VRAM because weights stay in 4-bit during compute. optimum.quanto modes (fp8/int8/int4) dequantize to bf16 for every matmul so peak activation memory is unchanged - these modes reduce load-time and serialization size only. Effective combination: --quantize_8bit --infusenet_quant nf4 Using --infusenet_quant alone without --quantize_8bit does not help because FLUX transformer in bf16 (~24 GB) dominates. ## Requirements nf4 : pip install bitsandbytes fp8/int8/int4: pip install optimum-quanto (already in requirements.txt) ## Backward compatibility Default is None (bf16, unchanged behavior). All existing flags work as before. Signed-off-by: David Zheng <dqzheng1996@gmail.com>

root added 2 commits April 5, 2026 14:26

Add --infusenet_quant {fp8,int8,int4} for InfuseNet quantization

5319792

dzhengAP changed the title ~~[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant {nf4,fp8,int8,int4}~~ [Enhancement] Extend quantization support to InfuseNet via --infusenet_quant {NF4,FP8,INT8,INT4} Apr 6, 2026

dzhengAP changed the title ~~[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant {NF4,FP8,INT8,INT4}~~ [Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4) Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4)#49

[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4)#49
dzhengAP wants to merge 2 commits intobytedance:mainfrom
dzhengAP:feat/infusenet-quantization

dzhengAP commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzhengAP commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Usage

Benchmark (H100 80GB, sm_90, seed=42, 10 steps)

Comparison Image

Notes

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dzhengAP commented Apr 6, 2026 •

edited

Loading