Skip to content

[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4)#49

Open
dzhengAP wants to merge 2 commits intobytedance:mainfrom
dzhengAP:feat/infusenet-quantization
Open

[Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4)#49
dzhengAP wants to merge 2 commits intobytedance:mainfrom
dzhengAP:feat/infusenet-quantization

Conversation

@dzhengAP
Copy link
Copy Markdown

@dzhengAP dzhengAP commented Apr 6, 2026

Motivation

InfuseNet is architecturally a full DiT side-network that mirrors the FLUX transformer — at ~6 GB in bf16, it's the second largest component after FLUX itself (~24 GB). Unlike IP-Adapter approaches that inject identity through attention layers, InfuseNet runs a complete parallel forward pass at every denoising step, making it a significant and persistent memory resident throughout inference.

The existing --quantize_8bit flag covers FLUX + T5 but leaves InfuseNet in bf16, which means users who need memory reduction are only getting partial coverage. Since InfuseNet is frozen at inference time (pure forward-pass, no gradients), it's an ideal PTQ target — the same reason the FLUX transformer quantizes cleanly.

The motivation is completing the quantization story: if a user is already running --quantize_8bit to fit within a memory budget, they should also be able to quantize InfuseNet with the same ease.

Usage

# Recommended: 1.35 GB peak VRAM reduction, no quality loss
python test.py --quantize_8bit --infusenet_quant nf4 ...

# Other modes (reduce load-time/serialization size, not peak inference VRAM)
python test.py --infusenet_quant int8 ...
python test.py --infusenet_quant fp8  ...
python test.py --infusenet_quant int4 ...

Benchmark (H100 80GB, sm_90, seed=42, 10 steps)

Config Load VRAM Peak VRAM Δ
--quantize_8bit (baseline) 19.26 GB 21.54 GB
--quantize_8bit --infusenet_quant nf4 17.94 GB 20.19 GB -1.35 GB
--quantize_8bit --infusenet_quant int8 19.29 GB 21.54 GB ~0
--infusenet_quant nf4 (FLUX bf16) 33.21 GB 35.47 GB N/A
--infusenet_quant int8 (FLUX bf16) 34.57 GB 36.82 GB N/A
--infusenet_quant fp8 (FLUX bf16) 34.58 GB 36.83 GB N/A
--infusenet_quant int4 (FLUX bf16) 33.26 GB 35.52 GB N/A

Comparison Image

comparison

Notes

  • NF4 (bitsandbytes) is the only mode that reduces peak inference VRAM — weights stay in 4-bit during compute
  • fp8/int8/int4 (optimum.quanto) dequantize to bf16 for matmuls — reduce load-time memory only
  • Best used as --quantize_8bit --infusenet_quant nf4 together
  • Default is None (bf16, fully backward compatible)

Requirements

  • NF4: pip install bitsandbytes
  • fp8/int8/int4: optimum-quanto (already in requirements.txt)

cc: @EndlessSora @YuminJia @lark @helloworld575

root added 2 commits April 5, 2026 14:26
…uantization

## Motivation

The existing --quantize_8bit flag quantizes the FLUX transformer and T5
text encoder via optimum.quanto, but InfuseNet (the identity injection
side-network, ~6GB in bf16) is always loaded in full precision regardless.
This PR adds --infusenet_quant to quantize InfuseNet independently.

## Changes

pipelines/pipeline_infu_flux.py:
- Add qfloat8, qint4 to optimum.quanto imports
- Add optional bitsandbytes import (BNB_AVAILABLE flag, graceful fallback)
- Add quantize_infusenet parameter to InfUFluxPipeline.__init__
- NF4 mode: uses BitsAndBytesConfig + FluxControlNetModel.from_pretrained
  for true 4-bit inference (weights stay in 4-bit during compute)
- fp8/int8/int4 modes: use optimum.quanto (weight-only, dequantizes to
  bf16 for compute - useful for load-time memory, not peak inference VRAM)

test.py:
- Add --infusenet_quant {nf4,fp8,int8,int4} argument
- Pass through to InfUFluxPipeline constructor

## Benchmark results (NVIDIA H100 80GB HBM3, sm_90, seed=42, 10 steps)

| Config                                  | Load VRAM | Peak VRAM | Delta    |
|-----------------------------------------|-----------|-----------|----------|
| --quantize_8bit (baseline)              | 19.26 GB  | 21.54 GB  | -        |
| --quantize_8bit --infusenet_quant nf4   | 17.94 GB  | 20.19 GB  | -1.35 GB |
| --quantize_8bit --infusenet_quant int8  | 19.29 GB  | 21.54 GB  | ~0       |
| --infusenet_quant nf4  (FLUX bf16)      | 33.21 GB  | 35.47 GB  | N/A      |
| --infusenet_quant int8 (FLUX bf16)      | 34.57 GB  | 36.82 GB  | N/A      |
| --infusenet_quant fp8  (FLUX bf16)      | 34.58 GB  | 36.83 GB  | N/A      |
| --infusenet_quant int4 (FLUX bf16)      | 33.26 GB  | 35.52 GB  | N/A      |

Key finding: --quantize_8bit --infusenet_quant nf4 reduces peak inference
VRAM by 1.35 GB with no visible quality degradation.

NF4 (bitsandbytes) is the only mode that reduces peak inference VRAM
because weights stay in 4-bit during compute. optimum.quanto modes
(fp8/int8/int4) dequantize to bf16 for every matmul so peak activation
memory is unchanged - these modes reduce load-time and serialization
size only.

Effective combination: --quantize_8bit --infusenet_quant nf4
Using --infusenet_quant alone without --quantize_8bit does not help
because FLUX transformer in bf16 (~24 GB) dominates.

## Requirements

nf4  : pip install bitsandbytes
fp8/int8/int4: pip install optimum-quanto (already in requirements.txt)

## Backward compatibility

Default is None (bf16, unchanged behavior). All existing flags work as before.

Signed-off-by: David Zheng <dqzheng1996@gmail.com>
@dzhengAP dzhengAP changed the title [Enhancement] Extend quantization support to InfuseNet via --infusenet_quant {nf4,fp8,int8,int4} [Enhancement] Extend quantization support to InfuseNet via --infusenet_quant {NF4,FP8,INT8,INT4} Apr 6, 2026
@dzhengAP dzhengAP changed the title [Enhancement] Extend quantization support to InfuseNet via --infusenet_quant {NF4,FP8,INT8,INT4} [Enhancement] Extend quantization support to InfuseNet via --infusenet_quant (NF4,FP8,INT8,INT4) Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant