08 Dec 10:17

9380e58

Diffusers 0.36.0: Pipelines galore, new caching method, training scripts, and more 🎄 Latest

Latest

The release features a number of new image and video pipelines, a new caching method, a new training script, new kernels - powered attention backends, and more. It is quite packed with a lot of new stuff, so make sure you read the release notes fully 🚀

New image pipelines

Flux2: Flux2 is the latest generation of image generation and editing model from Black Forest Labs. It’s capable of taking multiple input images as reference, making it versatile for different use cases.
Z-Image: Z-Image is a best-of-its-kind image generation model in the 6B param regime. Thanks to @JerryWu-code in #12703.
QwenImage Edit Plus: It’s an upgrade of QwenImage Edit and is capable of taking multiple input images as references. It can act as both a generation and an editing model. Thanks to @naykun for contributing in #12357.
Bria FIBO: FIBO is trained on structured JSON captions up to 1,000+ words and designed to understand and control different visual parameters such as lighting, composition, color, and camera settings, enabling precise and reproducible outputs. Thanks to @galbria for contributing this in #12545.
Kandinsky Image Lite: Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters). Thanks to @leffff for contributing this in #12664.
ChronoEdit: ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory. Thanks to @zhangjiewu for contributing this in #12593.

New video pipelines

Sana-Video: Sana-Video is a fast and efficient video generation model, equipped to handle long video sequences, thanks to its incorporation of linear attention. Thanks to @lawrence-cj for contributing this in #12634.
Kandinsky 5: Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem. Thanks to @leffff for contributing this in #12478.
Hunyuan 1.5: HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs.
Wan Animate: Wan-Animate is a state-of-the-art character animation and replacement video model based on Wan2.1. Given a reference character image and driving motion video, it can either animate the character with motion from the driving video, or replace the existing character in that video with that character.

New `kernels`-powered attention backends

The kernels library helps you save a lot of time by providing pre-built kernel interfaces for various environments and accelerators. This release features three new kernels-powered attention backends:

Flash Attention 3 (+ its varlen variant)
Flash Attention 2 (+ its varlen variant)
SAGE

This means if any of the above backend is supported by your development environment, you should be able to skip the manual process of building the corresponding kernels and just use:

# Make sure you have `kernels` installed: `pip install kernels`.
# You can choose `flash_hub` or `sage_hub`, too.
pipe.transformer.set_attention_backend("_flash_3_hub")

For more details, check out the documentation.

TaylorSeer cache

TaylorSeer is now supported in Diffusers, delivering upto 3x speedups with negligible-to-none quality compromise. Thanks to @toilaluan for contributing this in #12648. Check out the documentation here.

New training script

Our Flux.2 integration features a LoRA fine-tuning script that you can check out here. We provide a number of optimizations to help make it run on consumer GPUs.

Misc

Reusing AttentionMixin: Making certain compatible models subclass from the AttentionMixin class helped us get rid of 2K LoC. Going forward, users can expect more such refactorings that will help make the library leaner and simpler. Check out #12463 for more details.
Diffusers backend in SGLang: sgl-project/sglang#14112.
We started the Diffusers MVP program to work with talented community members who will help us improve the library across multiple fronts. Check out the link for more information.

All commits

remove unneeded checkpoint imports. by @sayakpaul in #12488
[tests] fix clapconfig for text backbone in audioldm2 by @sayakpaul in #12490
ltx0.9.8 (without IC lora, autoregressive sampling) by @yiyixuxu in #12493
[docs] Attention checks by @stevhliu in #12486
[CI] Check links by @stevhliu in #12491
[ci] xfail more incorrect transformer imports. by @sayakpaul in #12455
[tests] introduce VAETesterMixin to consolidate tests for slicing and tiling by @sayakpaul in #12374
docs: cleanup of runway model by @EazyAl in #12503
Kandinsky 5 is finally in Diffusers! by @leffff in #12478
Remove Qwen Image Redundant RoPE Cache by @dg845 in #12452
Raise warning instead of error when imports are missing for custom code by @DN6 in #12513
Fix: Use incorrect temporary variable key when replacing adapter name… by @FeiXie8 in #12502
[docs] Organize toctree by modality by @stevhliu in #12514
styling issues. by @sayakpaul in #12522
Add Photon model and pipeline support by @DavidBert in #12456
purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet by @Vaibhavs10 in #12497
Prx by @DavidBert in #12525
[core] AutoencoderMixin to abstract common methods by @sayakpaul in #12473
Kandinsky5 No cfg fix by @asomoza in #12527
Fix: Add _skip_keys for AutoencoderKLWan by @yiyixuxu in #12523
[CI] xfail the test_wuerstchen_prior test by @sayakpaul in #12530
[tests] Test attention backends by @sayakpaul in #12388
fix CI bug for kandinsky3_img2img case by @kaixuanliu in #12474
Fix MPS compatibility in get_1d_sincos_pos_embed_from_grid #12432 by @Aishwarya0811 in #12449
Handle deprecated transformer classes by @DN6 in #12517
fix constants.py to user upper() by @sayakpaul in #12479
HunyuanImage21 by @yiyixuxu in #12333
Loose the criteria tolerance appropriately for Intel XPU devices by @kaixuanliu in #12460
Deprecate Stable Cascade by @DN6 in #12537
[chore] Move guiders experimental warning by @sayakpaul in #12543
Fix Chroma attention padding order and update docs to use lodestones/Chroma1-HD by @josephrocca in #12508
Add AITER attention backend by @lauri9 in #12549
Fix small inconsistency in output dimension of "_get_t5_prompt_embeds" function in sd3 pipeline by @alirezafarashah in #12531
Kandinsky 5 10 sec (NABLA suport) by @leffff in #12520
Improve pos embed for Flux.1 inference on Ascend NPU by @gameofdimension in #12534
support latest few-step wan LoRA. by @sayakpaul in #12541
[Pipelines] Enable Wan VACE to run since single transformer by @DN6 in #12428
fix crash if tiling mode is enabled by @sywangyi in #12521
Fix typos in kandinsky5 docs by @Meatfucker in #12552
[ci] don't run sana layerwise casting tests in CI. by @sayakpaul in #12551
Bria fibo by @galbria in #12545
Avoiding graph break by changing the way we infer dtype in vae.decoder by @ppadjinTT in #12512
[Modular] Fix for custom block kwargs by @DN6 in #12561
[Modular] Allow custom blocks to be saved to local_dir by @DN6 in #12381
Fix Stable Diffusion 3.x pooled prompt embedding with multiple images by @friedrich in #12306
Fix custom code loading in Automodel by @DN6 in #12571
[modular] better warn message by @yiyixuxu in #12573
[tests] add tests for flux modular (t2i, i2i, kontext) by @sayakpaul in #12566
[modular]pass hub_kwargs to load_config by @yiyixuxu in #12577
ulysses enabling in native attention path by @sywangyi in #12563
Kandinsky ...

Contributors

turian, friedrich, and 50 other contributors

Assets 2

15 Oct 04:14

sayakpaul

v0.35.2

b712696

🐞 fixes for `transformers` models, imports,

All commits

Release: v0.35.1-patch by @sayakpaul (direct commit on v0.35.2-patch)
handle offload_state_dict when initing transformers models by @sayakpaul in #12438
[CI] Fix TRANSFORMERS_FLAX_WEIGHTS_NAME import issue by @DN6 in #12354
Fix PyTorch 2.3.1 compatibility: add version guard for torch.library.… by @Aishwarya0811 in #12206
fix scale_shift_factor being on cpu for wan and ltx by @vladmandic in #12347
Release: v0.35.2-patch by @sayakpaul (direct commit on v0.35.2-patch)

Contributors

DN6, sayakpaul, and 2 other contributors

Assets 2

20 Aug 04:17

sayakpaul

v0.35.1

0f252be

v0.35.1 for improvements in Qwen-Image Edit

Thanks to @naykun for the following PRs that improve Qwen-Image Edit:

Contributors

naykun

Assets 2

19 Aug 03:28

sayakpaul

v0.35.0

f27949d

Diffusers 0.35.0: Qwen Image pipelines, Flux Kontext, Wan 2.2, and more

This release comes packed with new image generation and editing pipelines, a new video pipeline, new training scripts, quality-of-life improvements, and much more. Read the rest of the release notes fully to not miss out on the fun stuff.

New pipelines 🧨

We welcomed new pipelines in this release:

Wan 2.2
Flux-Kontext
Qwen-Image
Qwen-Image-Edit

Wan 2.2 📹

This update to Wan provides significant improvements in video fidelity, prompt adherence, and style. Please check out the official doc to learn more.

Flux-Kontext 🎇

Flux-Kontext is a 12-billion-parameter rectified flow transformer capable of editing images based on text instructions. Please check out the official doc to learn more about it.

Qwen-Image 🌅

After a successful run of delivering language models and vision-language models, the Qwen team is back with an image generation model, which is Apache-2.0 licensed! It achieves significant advances in complex text rendering and precise image editing. To learn more about this powerful model, refer to our docs.

Thanks to @naykun for contributing both Qwen-Image and Qwen-Image-Edit via this PR and this PR.

New training scripts 🎛️

Make these newly added models your own with our training scripts:

Single-file modeling implementations

Following the 🤗 Transformers’ philosophy of single-file modeling implementations, we have started implementing modeling code in single and self-contained files. The Flux Transformer code is one example of this.

Attention refactor

We have massively refactored how we do attention in the models. This allows us to provide support for different attention backends (such as PyTorch native scaled_dot_product_attention, Flash Attention 3, SAGE attention, etc.) in the library seamlessly.

Having attention supported this way also allows us to integrate different parallelization mechanisms, which we’re actively working on. Follow this PR if you’re interested.

Users shouldn’t be affected at all by these changes. Please open an issue if you face any problems.

Regional compilation

Regional compilation trims cold-start latency by only compiling the small and frequently-repeated block(s) of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Refer to this doc to learn more.

Thanks to @anijain2305 for contributing this feature in this PR.

We have also authored a number of posts that center around the use of torch.compile. You can check them out at the links below:

Faster pipeline loading ⚡️

Users can now load pipelines directly on an accelerator device leading to significantly faster load times. This particularly becomes evident when loading large pipelines like Wan and Qwen-Image.

from diffusers import DiffusionPipeline
import torch 

ckpt_id = "Qwen/Qwen-Image"
pipe = DiffusionPipeline.from_pretrained(
-    ckpt_id, torch_dtype=torch.bfloat16
- ).to("cuda")
+    ckpt_id, torch_dtype=torch.bfloat16, device_map="cuda"
+ )

You can speed up loading even more by enabling parallelized loading of state dict shards. This is particularly helpful when you’re working with large models like Wan and Qwen-Image, where the model state dicts are typically sharded across multiple files.

import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "yes"

# rest of the loading code
....

Better GGUF integration

@Isotr0py contributed support for native GGUF CUDA kernels in this PR. This should provide an approximately 10% improvement in inference speed.

We have also worked on a tool for converting regular checkpoints to GGUF, letting the community easily share their GGUF checkpoints. Learn more here.

We now support loading of Diffusers format GGUF checkpoints.

You can learn more about all of this in our GGUF official docs.

Modular Diffusers (Experimental)

Modular Diffusers is a system for building diffusion pipelines pipelines with individual pipeline blocks. It is highly customisable, with blocks that can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows.

The API is currently in active development and is being released as an experimental feature. Learn more in our docs.

All commits

[tests] skip instead of returning. by @sayakpaul in #11793
adjust to get CI test cases passed on XPU by @kaixuanliu in #11759
fix deprecation in lora after 0.34.0 release by @sayakpaul in #11802
[chore] post release v0.34.0 by @sayakpaul in #11800
Follow up for Group Offload to Disk by @DN6 in #11760
[rfc][compile] compile method for DiffusionPipeline by @anijain2305 in #11705
[tests] add a test on torch compile for varied resolutions by @sayakpaul in #11776
adjust tolerance criteria for test_float16_inference in unit test by @kaixuanliu in #11809
Flux Kontext by @a-r-r-o-w in #11812
Kontext training by @sayakpaul in #11813
Kontext fixes by @a-r-r-o-w in #11815
remove syncs before denoising in Kontext by @sayakpaul in #11818
[CI] disable onnx, mps, flax from the CI by @sayakpaul in #11803
TorchAO compile + offloading tests by @a-r-r-o-w in #11697
Support dynamically loading/unloading loras with group offloading by @a-r-r-o-w in #11804
[lora] fix: lora unloading behvaiour by @sayakpaul in #11822
[lora]feat: use exclude modules to loraconfig. by @sayakpaul in #11806
ENH: Improve speed of function expanding LoRA scales by @BenjaminBossan in #11834
Remove print statement in SCM Scheduler by @a-r-r-o-w in #11836
[tests] add test for hotswapping + compilation on resolution changes by @sayakpaul in #11825
reset deterministic in tearDownClass by @jiqing-feng in #11785
[tests] Fix failing float16 cuda tests by @a-r-r-o-w in #11835
[single file] Cosmos by @a-r-r-o-w in #11801
[docs] fix single_file example. by @sayakpaul in #11847
Use real-valued instead of complex tensors in Wan2.1 RoPE by @mjkvaak-amd in #11649
[docs] Batch generation by @stevhliu in #11841
[docs] Deprecated pipelines by @stevhliu in #11838
fix norm not training in train_control_lora_flux.py by @Luo-Yihang in #11832
[From Single File] support from_single_file method for WanVACE3DTransformer by @J4BEZ in #11807
[lora] tests for exclude_modules with Wan VACE by @sayakpaul in #11843
update: FluxKontextInpaintPipeline support by @vuongminh1907 in #11820
[Flux Kontext] Support Fal Kontext LoRA by @linoytsaban in #11823
[docs] Add a note of _keep_in_fp32_modules by @a-r-r-o-w in #11851
[benchmarks] overhaul benchmarks by @sayakpaul in #11565
FIX set_lora_device when target layers differ by @BenjaminBossan in #11844
Fix Wan AccVideo/CausVid fuse_lora by @a-r-r-o-w in #11856
[chore] deprecate blip controlnet pipeline. by @sayakpaul in #11877
[docs] fix references in flux pipelines. by @sayakpaul in #11857
[tests] remove tests for deprecated pipelines. by @sayakpaul in #11879
[docs] LoRA metadata by @stevhliu in #11848
[training ] add Kontext i2i training by @sayakpaul in #11858
[CI] Fix big GPU test marker by @DN6 in #11786
First Block Cache by @a-r-r-o-w in #11180
[tests] annotate compilation test classes with bnb by @sayakpaul in #11715
Update chroma.md by @shm4r7 in #11891
[CI] Speed up GPU PR Tests by @DN6 in #11887
Pin k-diffusion for CI by @sayakpaul in #11894
[Docker] update doc builder dockerfile to include quant libs. by @sayakpaul in #11728
[tests] Remove more deprecated tests by @sayakpaul in #11895
[tests] mark the wanvace lora tester flaky by @sayakpaul in #11883
[tests] add compile + offload tests for GGUF. by @sayakpaul in #11740
feat: add multiple input image support in Flux Kontext by @Net-Mist in #11880
Fix unique memory address when doing group-offloading with disk by @sayakpaul in #11767
[SD3] CFG Cutoff fix and official callback by @asomoza in #11890
The Modular Diffusers by @yiyixuxu in #9672
[quant] QoL improvements for pipeline-level quant config by @sayakpaul in ...

Contributors

piercus, okaris, and 48 other contributors

Assets 2

24 Jun 15:13

sayakpaul

v0.34.0

50dea89

Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more

📹 New video generation pipelines

Wan VACE

Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:

Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
Inpainting and Outpainting
Subject to Video (faces, object, characters, etc.)
Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)

The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.

Check out the docs to learn more.

Cosmos Predict2 Video2World

Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.

The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.

LTX 0.9.7 and Distilled

LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.

Check out the docs to learn more.

Hunyuan Video Framepack and F1

Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.

FusionX

The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():

from diffusers import WanTransformer3DModel

transformer = WanTransformer3DModel.from_single_file(
    "https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
    torch_dtype=torch.bfloat16
)

To load the LoRAs, use load_lora_weights():

pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
    "vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)

AccVideo and CausVid (only LoRAs)

AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.

🌠 New image generation pipelines

Cosmos Predict2 Text2Image

Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.

Chroma

Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more

Thanks to @Ednaordinary for contributing it in this PR!

VisualCloze

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:

Support for various in-domain tasks
Generalization to unseen tasks through in-context learning
Unify multiple tasks into one step and generate both target image and intermediate results
Support reverse-engineering conditions from target images

Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!

Better `torch.compile` support

We have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:

Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:

Code

import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:

You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:

Code

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel

import torch
torch._dynamo.config.recompile_limit = 1000 

quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)

ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
    ckpt_id,
    subfolder="text_encoder_2",
    quantization_config=text_encoder_2_quant_config,
    torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
    ckpt_id,
    subfolder="transformer",
    quantization_config=dit_quant_config,
    torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=transformer,
    text_encoder_2=text_encoder_2,
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()

image = pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=3.5,
    height=768,
    width=1360,
    num_inference_steps=28,
    max_sequence_length=512,
).images[0]

Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.

Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.

Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.

PipelineQuantizationConfig

Users can now provide a quantization config while initializing a pipeline:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
     quant_backend="bitsandbytes_4bit",
     quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
     components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about [different configurations](https://huggingface.co/docs/diffusers/main/en/quantization/overview...

Contributors

iamwavecut, apolinario, and 86 other contributors

Assets 2

10 Apr 05:38

yiyixuxu

v0.33.1

375ec93

v0.33.1: fix ftfy import

All commits

fix ftfy import for wan pipelines by @yiyixuxu in #11262

Contributors

yiyixuxu

Assets 2

09 Apr 13:37

sayakpaul

v0.33.0

a2ed6b4

Diffusers 0.33.0: New Image and Video Models, Memory Optimizations, Caching Methods, Remote VAEs, New Training Scripts, and more

New Pipelines for Video Generation

Wan 2.1

Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. The model release includes 4 different model variants and three different pipelines for Text to Video, Image to Video and Video to Video.

Wan-AI/Wan2.1-T2V-1.3B-Diffusers
Wan-AI/Wan2.1-T2V-14B-Diffusers
Wan-AI/Wan2.1-I2V-14B-480P-Diffusers
Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

Check out the docs here to learn more.

LTX Video 0.9.5

LTX Video 0.9.5 is the updated version of the super-fast LTX Video model series. The latest model introduces additional conditioning options, such as keyframe-based animation and video extension (both forward and backward).

To support these additional conditioning inputs, we’ve introduced the LTXConditionPipeline and LTXVideoCondition object.

To learn more about the usage, check out the docs here.

Hunyuan Image to Video

Hunyuan utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder. The input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data and seamlessly integrating information from both the image and its associated caption.

To learn more, check out the docs here.

Others

EasyAnimateV5 (thanks to @bubbliiiing for contributing this in this PR)
ConsisID (thanks to @SHYuanBest for contributing this in this PR)

New Pipelines for Image Generation

Sana-Sprint

SANA-Sprint is an efficient diffusion model for ultra-fast text-to-image generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4, rivaling the quality of models like Flux.

Shoutout to @lawrence-cj for their help and guidance on this PR.

Check out the pipeline docs of SANA-Sprint to learn more.

Lumina2

Lumina-Image-2.0 is a 2B parameter flow-based diffusion transformer for text-to-image generation released under the Apache 2.0 license.

Check out the docs to learn more. Thanks to @zhuole1025 for contributing this through this PR.

One can also LoRA fine-tune Lumina2, taking advantage of its Apach2.0 licensing. Check out the guide for more details.

Omnigen

OmniGen is a unified image generation model that can handle multiple tasks including text-to-image, image editing, subject-driven generation, and various computer vision tasks within a single framework. The model consists of a VAE, and a single transformer based on Phi-3 that handles text and image encoding as well as the diffusion process.

Check out the docs to learn more about OmniGen. Thanks to @staoxiao for contributing OmniGen in this PR.

Others

CogView4 (thanks to @zRzRzRzRzRzRzR for contributing CogView4 in this PR)

New Memory Optimizations

Layerwise Casting

PyTorch supports torch.float8_e4m3fn and torch.float8_e5m2 as weight storage dtypes, but they can’t be used for computation on many devices due to unimplemented kernel support.

However, you can still use these dtypes to store model weights in FP8 precision and upcast them to a widely supported dtype such as torch.float16 or torch.bfloat16 on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. This can potentially cut down the VRAM requirements of a model by 50%.

Code

import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video

model_id = "THUDM/CogVideoX-5b"

# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)

# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)

Group Offloading

Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either torch.nn.ModuleList or torch.nn.Sequential), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.

On CUDA devices, we also have the option to enable using layer prefetching with CUDA Streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed which makes inference substantially faster while still keeping VRAM requirements very low. With this, we introduce the idea of overlapping computation with data transfer.

One thing to note is that using CUDA streams can cause a considerable spike in CPU RAM usage. Please ensure that the available CPU RAM is 2 times the size of the model if you choose to set use_stream=True. You can reduce CPU RAM usage by setting low_cpu_mem_usage=True. This should limit the CPU RAM used to be roughly the same as the size of the model, but will introduce slight latency in the inference process.

You can also use record_stream=True when using use_stream=True to obtain more speedups at the expense of slightly increased memory usage.

Code

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)

# We can utilize the enable_group_offload method for Diffusers model implementations
pipe.transformer.enable_group_offload(
	onload_device=onload_device, 
	offload_device=offload_device, 
	offload_type="leaf_level", 
	use_stream=True
)

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)

Group offloading can also be applied to non-Diffusers models such as text encoders from the transformers library.

Code

import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video

# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)

# For any other model implementations, the apply_group_offloading function can be used
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)

Remote Components

Remote components are an experimental feature designed to offload memory-intensive steps of t...

Contributors

catwell, dimitribarbot, and 88 other contributors

Assets 2

15 Jan 16:46

DN6

v0.32.2

560fb5f

v0.32.2

Fixes for Flux Single File loading, LoRA loading for 4bit BnB Flux, Hunyuan Video

This patch release

Fixes a regression in loading Comfy UI format single file checkpoints for Flux
Fixes a regression in loading LoRAs with bitsandbytes 4bit quantized Flux models
Adds unload_lora_weights for Flux Control
Fixes a bug that prevents Hunyuan Video from running with batch size > 1
Allow Hunyuan Video to load LoRAs created from the original repository code

All commits

[Single File] Fix loading Flux Dev finetunes with Comfy Prefix by @DN6 in #10545
[CI] Update HF Token on Fast GPU Model Tests by @DN6 #10570
[CI] Update HF Token in Fast GPU Tests by @DN6 #10568
Fix batch > 1 in HunyuanVideo by @hlky in #10548
Fix HunyuanVideo produces NaN on PyTorch<2.5 by @hlky in #10482
Fix hunyuan video attention mask dim by @a-r-r-o-w in #10454
[LoRA] Support original format loras for HunyuanVideo by @a-r-r-o-w in #10376
[LoRA] feat: support loading loras into 4bit quantized Flux models. by @sayakpaul in #10578
[LoRA] clean up load_lora_into_text_encoder() and fuse_lora() copied from by @sayakpaul in #10495
[LoRA] feat: support unload_lora_weights() for Flux Control. by @sayakpaul in #10206
Fix Flux multiple Lora loading bug by @maxs-kan in #10388
[LoRA] fix: lora unloading when using expanded Flux LoRAs. by @sayakpaul in #10397

Contributors

DN6, sayakpaul, and 3 other contributors

Assets 2

25 Dec 12:34

a-r-r-o-w

v0.32.1

e8aacda

v0.32.1

TorchAO Quantizer fixes

This patch release fixes a few bugs related to the TorchAO Quantizer introduced in v0.32.0.

Importing Diffusers would raise an error in PyTorch versions lower than 2.3.0. This should no longer be a problem.
Device Map does not work as expected when using the quantizer. We now raise an error if it is used. Support for using device maps with different quantization backends will be added in the near future.
Quantization was not performed due to faulty logic. This is now fixed and better tested.

Refer to our documentation to learn more about how to use different quantization backends.

All commits

make style for #10368 by @yiyixuxu in #10370
fix test pypi installation in the release workflow by @sayakpaul in #10360
Fix TorchAO related bugs; revert device_map changes by @a-r-r-o-w in #10371

Contributors

yiyixuxu, sayakpaul, and a-r-r-o-w

Assets 2

23 Dec 16:00

sayakpaul

v0.32.0

cd4d0d8

Diffusers 0.32.0: New video pipelines, new image pipelines, new quantization backends, new training scripts, and more

hunyuan-output.mp4

This release took a while, but it has many exciting updates. It contains several new pipelines for image and video generation, new quantization backends, and more.

Going forward, to provide more transparency to the community about ongoing developments and releases in Diffusers, we will be making use of a roadmap tracker.

New Video Generation Pipelines 📹

Open video generation models are on the rise, and we’re pleased to provide comprehensive integration support for all of them. The following video pipelines are bundled in this release:

Check out this section to learn more about the fine-tuning options available for these new video models.

New Image Generation Pipelines

SANA
- Text-to-image
- PAG
Flux Control (including Control LoRA)
- Depth Control
- Canny Control
Flux Redux
Flux Fill Inpainting / Outpainting
Flux RF-Inversion
SD3.5 ControlNet
ControlNet Union XL
SD3.5 IP Adapter
Flux IP adapter

Important Note about the new Flux Models

We can combine the regular Flux.1 Dev LoRAs with Flux Control LoRAs, Flux Control, and Flux Fill. For example, you can enable few-steps inference with Flux Fill using:

from diffusers import FluxFillPipeline
from diffusers.utils import load_image
import torch

pipe = FluxFillPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-Fill-dev", torch_dtype=torch.bfloat16
).to("cuda")

adapter_id = "alimama-creative/FLUX.1-Turbo-Alpha"
pipe.load_lora_weights(adapter_id)

image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup.png")
mask = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup_mask.png")

image = pipe(
    prompt="a white paper cup",
    image=image,
    mask_image=mask,
    height=1632,
    width=1232,
    guidance_scale=30,
    num_inference_steps=8,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-fill-dev.png")

To learn more, check out the documentation.

Note

SANA is a small model compared to other models like Flux and Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. We support LoRA fine-tuning of SANA. Check out this section for more details.

Acknowledgements

Shoutout to @lawrence-cj and @chenjy2003 for contributing SANA in this PR. SANA also features a Deep Compression Autoencoder, which was contributed by @lawrence-cj in this PR.
Shoutout to @guiyrt for contributing SD3.5 IP Adapter in this PR.

New Quantization Backends

TorchAO
GGUF

Please be aware of the following caveats:

TorchAO quantized checkpoints cannot be serialized in safetensors currently. This may change in the future.
GGUF currently only supports loading pre-quantized checkpoints into models in this release. Support for saving models with GGUF quantization will be added in the future.

New training scripts

This release features many new training scripts for the community to play:

All commits

post-release 0.31.0 by @sayakpaul in #9742
fix bug in require_accelerate_version_greater by @faaany in #9746
[Official callbacks] SDXL Controlnet CFG Cutoff by @asomoza in #9311
[SD3-5 dreambooth lora] update model cards by @linoytsaban in #9749
config attribute not foud error for FluxImagetoImage Pipeline for multi controlnet solved by @rshah240 in #9586
Some minor updates to the nightly and push workflows by @sayakpaul in #9759
[Docs] fix docstring typo in SD3 pipeline by @shenzhiy21 in #9765
[bugfix] bugfix for npu free memory by @leisuzz in #9640
[research_projects] add flux training script with quantization by @sayakpaul in #9754
Add a doc for AWS Neuron in Diffusers by @JingyaHuang in #9766
[refactor] enhance readability of flux related pipelines by @Luciennnnnnn in #9711
Added Support of Xlabs controlnet to FluxControlNetInpaintPipeline by @SahilCarterr in #9770
[research_projects] Update README.md to include a note about NF5 T5-xxl by @sayakpaul in #9775
[Fix] train_dreambooth_lora_flux_advanced ValueError: unexpected save model: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> by @rootonchair in #9777
[Fix] remove setting lr for T5 text encoder when using prodigy in flux dreambooth lora script by @biswaroop1547 in #9473
[SD 3.5 Dreambooth LoRA] support configurable training block & layers by @linoytsaban in #9762
[flux dreambooth lora training] make LoRA target modules configurable + small bug fix by @linoytsaban in #9646
adds the pipeline for pixart alpha controlnet by @raulc0399 in #8857
[core] Allegro T2V by @a-r-r-o-w in #9736
Allegro VAE fix by @a-r-r-o-w in #9811
[CI] add new runner for testing by @sayakpaul in #9699
[training] fixes to the quantization training script and add AdEMAMix optimizer as an option by @sayakpaul in #9806
[training] use the lr when using 8bit adam. by @sayakpaul in #9796
[Tests] clean up and refactor gradient checkpointing tests by @sayakpaul in #9494
[CI] add a big GPU marker to run memory-intensive tests separately on CI by @sayakpaul in #9691
[LoRA] fix: lora loading when using with a device_mapped model. by @sayakpaul in #9449
Revert "[LoRA] fix: lora loading when using with a device_mapped mode… by @yiyixuxu in #9823
[Model Card] standardize advanced diffusion training sd15 lora by @chiral-carbon in #7613
NPU Adaption for FLUX by @leisuzz in #9751
Fixes EMAModel "from_pretrained" method by @SahilCarterr in #9779
Update train_controlnet_flux.py,Fix size mismatch issue in validation by @ScilenceForest in #9679
Handling mixed precision for dreambooth flux lora training by @icsl-Jeon in #9565
Reduce Memory Cost in Flux Training by @leisuzz in #9829
Add Diffusion Policy for Reinforcement Learning by @DorsaRoh in #9824
[feat] add load_lora_adapter() for compatible models by @sayakpaul in #9712
Refac training utils.py by @RogerSinghChugh in #9815
[core] Mochi T2V by @a-r-r-o-w in #9769
[Fix] Test of sd3 lora by @SahilCarterr in #9843
Fix: Remove duplicated comma in distributed_inference.md by @vahidaskari in #9868
Add new community pipeline for 'Adaptive Mask Inpainting', introduced in [ECCV2024] ComA by @jellyheadandrew in #9228
Updated _encode_prompt_with_clip and encode_prompt in train_dreamboth_sd3 by @SahilCarterr in #9800
[Core] introduce controlnet module by @sayakpaul in #8768
[Flux] reduce explicit device transfers and typecasting in flux. by @sayakpaul in #9817
Improve downloads of sharded variants by @DN6 in #9869
[fix] Replaced shutil.copy with shutil.copyfile by @SahilCarterr in #9885
Enabling gradient checkpointing in eval() mode by @MikeTkachuk in #9878
[FIX] Fix TypeError in DreamBooth SDXL when use_dora is False by @SahilCarterr in #9879
[Advanced LoRA v1.5] fix: gradient unscaling problem by @sayakpaul in #7018
Revert "[Flux] reduce explicit device transfers and typecasting in flux." by @sayakpaul in #9896
Feature IP Adapter Xformers Attention Processor by @elismasilva in #9881
Notebooks for Community Scripts Examples by @ParagEkbote in #9905
Fix Progress Bar Updates in SD 1.5 PAG Img2Img pipeline by @painebenjamin in #9925
Update pipeline_flux_img2img.py by @example-git in #9928
add de...