Releases: huggingface/diffusers
Diffusers 0.36.0: Pipelines galore, new caching method, training scripts, and more 🎄
The release features a number of new image and video pipelines, a new caching method, a new training script, new kernels - powered attention backends, and more. It is quite packed with a lot of new stuff, so make sure you read the release notes fully 🚀
New image pipelines
- Flux2: Flux2 is the latest generation of image generation and editing model from Black Forest Labs. It’s capable of taking multiple input images as reference, making it versatile for different use cases.
- Z-Image: Z-Image is a best-of-its-kind image generation model in the 6B param regime. Thanks to @JerryWu-code in #12703.
- QwenImage Edit Plus: It’s an upgrade of QwenImage Edit and is capable of taking multiple input images as references. It can act as both a generation and an editing model. Thanks to @naykun for contributing in #12357.
- Bria FIBO: FIBO is trained on structured JSON captions up to 1,000+ words and designed to understand and control different visual parameters such as lighting, composition, color, and camera settings, enabling precise and reproducible outputs. Thanks to @galbria for contributing this in #12545.
- Kandinsky Image Lite: Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters). Thanks to @leffff for contributing this in #12664.
- ChronoEdit: ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory. Thanks to @zhangjiewu for contributing this in #12593.
New video pipelines
- Sana-Video: Sana-Video is a fast and efficient video generation model, equipped to handle long video sequences, thanks to its incorporation of linear attention. Thanks to @lawrence-cj for contributing this in #12634.
- Kandinsky 5: Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem. Thanks to @leffff for contributing this in #12478.
- Hunyuan 1.5: HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs.
- Wan Animate: Wan-Animate is a state-of-the-art character animation and replacement video model based on Wan2.1. Given a reference character image and driving motion video, it can either animate the character with motion from the driving video, or replace the existing character in that video with that character.
New kernels-powered attention backends
The kernels library helps you save a lot of time by providing pre-built kernel interfaces for various environments and accelerators. This release features three new kernels-powered attention backends:
- Flash Attention 3 (+ its
varlenvariant) - Flash Attention 2 (+ its
varlenvariant) - SAGE
This means if any of the above backend is supported by your development environment, you should be able to skip the manual process of building the corresponding kernels and just use:
# Make sure you have `kernels` installed: `pip install kernels`.
# You can choose `flash_hub` or `sage_hub`, too.
pipe.transformer.set_attention_backend("_flash_3_hub")For more details, check out the documentation.
TaylorSeer cache
TaylorSeer is now supported in Diffusers, delivering upto 3x speedups with negligible-to-none quality compromise. Thanks to @toilaluan for contributing this in #12648. Check out the documentation here.
New training script
Our Flux.2 integration features a LoRA fine-tuning script that you can check out here. We provide a number of optimizations to help make it run on consumer GPUs.
Misc
- Reusing
AttentionMixin: Making certain compatible models subclass from theAttentionMixinclass helped us get rid of 2K LoC. Going forward, users can expect more such refactorings that will help make the library leaner and simpler. Check out #12463 for more details. - Diffusers backend in SGLang: sgl-project/sglang#14112.
- We started the Diffusers MVP program to work with talented community members who will help us improve the library across multiple fronts. Check out the link for more information.
All commits
- remove unneeded checkpoint imports. by @sayakpaul in #12488
- [tests] fix clapconfig for text backbone in audioldm2 by @sayakpaul in #12490
- ltx0.9.8 (without IC lora, autoregressive sampling) by @yiyixuxu in #12493
- [docs] Attention checks by @stevhliu in #12486
- [CI] Check links by @stevhliu in #12491
- [ci] xfail more incorrect transformer imports. by @sayakpaul in #12455
- [tests] introduce
VAETesterMixinto consolidate tests for slicing and tiling by @sayakpaul in #12374 - docs: cleanup of runway model by @EazyAl in #12503
- Kandinsky 5 is finally in Diffusers! by @leffff in #12478
- Remove Qwen Image Redundant RoPE Cache by @dg845 in #12452
- Raise warning instead of error when imports are missing for custom code by @DN6 in #12513
- Fix: Use incorrect temporary variable key when replacing adapter name… by @FeiXie8 in #12502
- [docs] Organize toctree by modality by @stevhliu in #12514
- styling issues. by @sayakpaul in #12522
- Add Photon model and pipeline support by @DavidBert in #12456
- purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet by @Vaibhavs10 in #12497
- Prx by @DavidBert in #12525
- [core]
AutoencoderMixinto abstract common methods by @sayakpaul in #12473 - Kandinsky5 No cfg fix by @asomoza in #12527
- Fix: Add _skip_keys for AutoencoderKLWan by @yiyixuxu in #12523
- [CI] xfail the test_wuerstchen_prior test by @sayakpaul in #12530
- [tests] Test attention backends by @sayakpaul in #12388
- fix CI bug for kandinsky3_img2img case by @kaixuanliu in #12474
- Fix MPS compatibility in get_1d_sincos_pos_embed_from_grid #12432 by @Aishwarya0811 in #12449
- Handle deprecated transformer classes by @DN6 in #12517
- fix constants.py to user
upper()by @sayakpaul in #12479 - HunyuanImage21 by @yiyixuxu in #12333
- Loose the criteria tolerance appropriately for Intel XPU devices by @kaixuanliu in #12460
- Deprecate Stable Cascade by @DN6 in #12537
- [chore] Move guiders experimental warning by @sayakpaul in #12543
- Fix Chroma attention padding order and update docs to use
lodestones/Chroma1-HDby @josephrocca in #12508 - Add AITER attention backend by @lauri9 in #12549
- Fix small inconsistency in output dimension of "_get_t5_prompt_embeds" function in sd3 pipeline by @alirezafarashah in #12531
- Kandinsky 5 10 sec (NABLA suport) by @leffff in #12520
- Improve pos embed for Flux.1 inference on Ascend NPU by @gameofdimension in #12534
- support latest few-step wan LoRA. by @sayakpaul in #12541
- [Pipelines] Enable Wan VACE to run since single transformer by @DN6 in #12428
- fix crash if tiling mode is enabled by @sywangyi in #12521
- Fix typos in kandinsky5 docs by @Meatfucker in #12552
- [ci] don't run sana layerwise casting tests in CI. by @sayakpaul in #12551
- Bria fibo by @galbria in #12545
- Avoiding graph break by changing the way we infer dtype in vae.decoder by @ppadjinTT in #12512
- [Modular] Fix for custom block kwargs by @DN6 in #12561
- [Modular] Allow custom blocks to be saved to
local_dirby @DN6 in #12381 - Fix Stable Diffusion 3.x pooled prompt embedding with multiple images by @friedrich in #12306
- Fix custom code loading in Automodel by @DN6 in #12571
- [modular] better warn message by @yiyixuxu in #12573
- [tests] add tests for flux modular (t2i, i2i, kontext) by @sayakpaul in #12566
- [modular]pass hub_kwargs to load_config by @yiyixuxu in #12577
- ulysses enabling in native attention path by @sywangyi in #12563
- Kandinsky ...
🐞 fixes for `transformers` models, imports,
All commits
- Release: v0.35.1-patch by @sayakpaul (direct commit on v0.35.2-patch)
- handle offload_state_dict when initing transformers models by @sayakpaul in #12438
- [CI] Fix TRANSFORMERS_FLAX_WEIGHTS_NAME import issue by @DN6 in #12354
- Fix PyTorch 2.3.1 compatibility: add version guard for torch.library.… by @Aishwarya0811 in #12206
- fix scale_shift_factor being on cpu for wan and ltx by @vladmandic in #12347
- Release: v0.35.2-patch by @sayakpaul (direct commit on v0.35.2-patch)
v0.35.1 for improvements in Qwen-Image Edit
Diffusers 0.35.0: Qwen Image pipelines, Flux Kontext, Wan 2.2, and more
This release comes packed with new image generation and editing pipelines, a new video pipeline, new training scripts, quality-of-life improvements, and much more. Read the rest of the release notes fully to not miss out on the fun stuff.
New pipelines 🧨
We welcomed new pipelines in this release:
- Wan 2.2
- Flux-Kontext
- Qwen-Image
- Qwen-Image-Edit
Wan 2.2 📹
This update to Wan provides significant improvements in video fidelity, prompt adherence, and style. Please check out the official doc to learn more.
Flux-Kontext 🎇
Flux-Kontext is a 12-billion-parameter rectified flow transformer capable of editing images based on text instructions. Please check out the official doc to learn more about it.
Qwen-Image 🌅
After a successful run of delivering language models and vision-language models, the Qwen team is back with an image generation model, which is Apache-2.0 licensed! It achieves significant advances in complex text rendering and precise image editing. To learn more about this powerful model, refer to our docs.
Thanks to @naykun for contributing both Qwen-Image and Qwen-Image-Edit via this PR and this PR.
New training scripts 🎛️
Make these newly added models your own with our training scripts:
Single-file modeling implementations
Following the 🤗 Transformers’ philosophy of single-file modeling implementations, we have started implementing modeling code in single and self-contained files. The Flux Transformer code is one example of this.
Attention refactor
We have massively refactored how we do attention in the models. This allows us to provide support for different attention backends (such as PyTorch native scaled_dot_product_attention, Flash Attention 3, SAGE attention, etc.) in the library seamlessly.
Having attention supported this way also allows us to integrate different parallelization mechanisms, which we’re actively working on. Follow this PR if you’re interested.
Users shouldn’t be affected at all by these changes. Please open an issue if you face any problems.
Regional compilation
Regional compilation trims cold-start latency by only compiling the small and frequently-repeated block(s) of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Refer to this doc to learn more.
Thanks to @anijain2305 for contributing this feature in this PR.
We have also authored a number of posts that center around the use of torch.compile. You can check them out at the links below:
- Presenting Flux Fast: Making Flux go brrr on H100s
- torch.compile and Diffusers: A Hands-On Guide to Peak Performance
- Fast LoRA inference for Flux with Diffusers and PEFT
Faster pipeline loading ⚡️
Users can now load pipelines directly on an accelerator device leading to significantly faster load times. This particularly becomes evident when loading large pipelines like Wan and Qwen-Image.
from diffusers import DiffusionPipeline
import torch
ckpt_id = "Qwen/Qwen-Image"
pipe = DiffusionPipeline.from_pretrained(
- ckpt_id, torch_dtype=torch.bfloat16
- ).to("cuda")
+ ckpt_id, torch_dtype=torch.bfloat16, device_map="cuda"
+ ) You can speed up loading even more by enabling parallelized loading of state dict shards. This is particularly helpful when you’re working with large models like Wan and Qwen-Image, where the model state dicts are typically sharded across multiple files.
import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "yes"
# rest of the loading code
....Better GGUF integration
@Isotr0py contributed support for native GGUF CUDA kernels in this PR. This should provide an approximately 10% improvement in inference speed.
We have also worked on a tool for converting regular checkpoints to GGUF, letting the community easily share their GGUF checkpoints. Learn more here.
We now support loading of Diffusers format GGUF checkpoints.
You can learn more about all of this in our GGUF official docs.
Modular Diffusers (Experimental)
Modular Diffusers is a system for building diffusion pipelines pipelines with individual pipeline blocks. It is highly customisable, with blocks that can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows.
The API is currently in active development and is being released as an experimental feature. Learn more in our docs.
All commits
- [tests] skip instead of returning. by @sayakpaul in #11793
- adjust to get CI test cases passed on XPU by @kaixuanliu in #11759
- fix deprecation in lora after 0.34.0 release by @sayakpaul in #11802
- [chore] post release v0.34.0 by @sayakpaul in #11800
- Follow up for Group Offload to Disk by @DN6 in #11760
- [rfc][compile] compile method for DiffusionPipeline by @anijain2305 in #11705
- [tests] add a test on torch compile for varied resolutions by @sayakpaul in #11776
- adjust tolerance criteria for
test_float16_inferencein unit test by @kaixuanliu in #11809 - Flux Kontext by @a-r-r-o-w in #11812
- Kontext training by @sayakpaul in #11813
- Kontext fixes by @a-r-r-o-w in #11815
- remove syncs before denoising in Kontext by @sayakpaul in #11818
- [CI] disable onnx, mps, flax from the CI by @sayakpaul in #11803
- TorchAO compile + offloading tests by @a-r-r-o-w in #11697
- Support dynamically loading/unloading loras with group offloading by @a-r-r-o-w in #11804
- [lora] fix: lora unloading behvaiour by @sayakpaul in #11822
- [lora]feat: use exclude modules to loraconfig. by @sayakpaul in #11806
- ENH: Improve speed of function expanding LoRA scales by @BenjaminBossan in #11834
- Remove print statement in SCM Scheduler by @a-r-r-o-w in #11836
- [tests] add test for hotswapping + compilation on resolution changes by @sayakpaul in #11825
- reset deterministic in tearDownClass by @jiqing-feng in #11785
- [tests] Fix failing float16 cuda tests by @a-r-r-o-w in #11835
- [single file] Cosmos by @a-r-r-o-w in #11801
- [docs] fix single_file example. by @sayakpaul in #11847
- Use real-valued instead of complex tensors in Wan2.1 RoPE by @mjkvaak-amd in #11649
- [docs] Batch generation by @stevhliu in #11841
- [docs] Deprecated pipelines by @stevhliu in #11838
- fix norm not training in train_control_lora_flux.py by @Luo-Yihang in #11832
- [From Single File] support
from_single_filemethod forWanVACE3DTransformerby @J4BEZ in #11807 - [lora] tests for
exclude_moduleswith Wan VACE by @sayakpaul in #11843 - update: FluxKontextInpaintPipeline support by @vuongminh1907 in #11820
- [Flux Kontext] Support Fal Kontext LoRA by @linoytsaban in #11823
- [docs] Add a note of
_keep_in_fp32_modulesby @a-r-r-o-w in #11851 - [benchmarks] overhaul benchmarks by @sayakpaul in #11565
- FIX set_lora_device when target layers differ by @BenjaminBossan in #11844
- Fix Wan AccVideo/CausVid fuse_lora by @a-r-r-o-w in #11856
- [chore] deprecate blip controlnet pipeline. by @sayakpaul in #11877
- [docs] fix references in flux pipelines. by @sayakpaul in #11857
- [tests] remove tests for deprecated pipelines. by @sayakpaul in #11879
- [docs] LoRA metadata by @stevhliu in #11848
- [training ] add Kontext i2i training by @sayakpaul in #11858
- [CI] Fix big GPU test marker by @DN6 in #11786
- First Block Cache by @a-r-r-o-w in #11180
- [tests] annotate compilation test classes with bnb by @sayakpaul in #11715
- Update chroma.md by @shm4r7 in #11891
- [CI] Speed up GPU PR Tests by @DN6 in #11887
- Pin k-diffusion for CI by @sayakpaul in #11894
- [Docker] update doc builder dockerfile to include quant libs. by @sayakpaul in #11728
- [tests] Remove more deprecated tests by @sayakpaul in #11895
- [tests] mark the wanvace lora tester flaky by @sayakpaul in #11883
- [tests] add compile + offload tests for GGUF. by @sayakpaul in #11740
- feat: add multiple input image support in Flux Kontext by @Net-Mist in #11880
- Fix unique memory address when doing group-offloading with disk by @sayakpaul in #11767
- [SD3] CFG Cutoff fix and official callback by @asomoza in #11890
- The Modular Diffusers by @yiyixuxu in #9672
- [quant] QoL improvements for pipeline-level quant config by @sayakpaul in ...
Diffusers 0.34.0: New Image and Video Models, Better torch.compile Support, and more
📹 New video generation pipelines
Wan VACE
Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:
- Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: huggingface/controlnet_aux
- Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips)
- Inpainting and Outpainting
- Subject to Video (faces, object, characters, etc.)
- Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.)
The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.
Check out the docs to learn more.
Cosmos Predict2 Video2World
Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.
The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.
LTX 0.9.7 and Distilled
LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.
Check out the docs to learn more.
Hunyuan Video Framepack and F1
Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.
FusionX
The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():
from diffusers import WanTransformer3DModel
transformer = WanTransformer3DModel.from_single_file(
"https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
torch_dtype=torch.bfloat16
)To load the LoRAs, use load_lora_weights():
pipe = DiffusionPipeline.from_pretrained(
"Wan-AI/Wan2.1-T2V-14B-Diffusers",
torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
"vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)AccVideo and CausVid (only LoRAs)
AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.
🌠 New image generation pipelines
Cosmos Predict2 Text2Image
Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.
Chroma
Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more
Thanks to @Ednaordinary for contributing it in this PR!
VisualCloze
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:
- Support for various in-domain tasks
- Generalization to unseen tasks through in-context learning
- Unify multiple tasks into one step and generate both target image and intermediate results
- Support reverse-engineering conditions from target images
Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!
Better torch.compile support
We have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:
Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:
Code
import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()
image = pipeline(
prompt="An astronaut riding a horse on Mars",
guidance_scale=0.,
height=768,
width=1360,
num_inference_steps=4,
max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:
You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:
Code
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
import torch
torch._dynamo.config.recompile_limit = 1000
quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)
ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
ckpt_id,
subfolder="text_encoder_2",
quantization_config=text_encoder_2_quant_config,
torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
ckpt_id,
subfolder="transformer",
quantization_config=dit_quant_config,
torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
ckpt_id,
transformer=transformer,
text_encoder_2=text_encoder_2,
torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()
image = pipeline(
prompt="An astronaut riding a horse on Mars",
guidance_scale=3.5,
height=768,
width=1360,
num_inference_steps=28,
max_sequence_length=512,
).images[0]Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.
Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.
Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.
PipelineQuantizationConfig
Users can now provide a quantization config while initializing a pipeline:
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe("photo of a cute dog").images[0]This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about [different configurations](https://huggingface.co/docs/diffusers/main/en/quantization/overview...
v0.33.1: fix ftfy import
Diffusers 0.33.0: New Image and Video Models, Memory Optimizations, Caching Methods, Remote VAEs, New Training Scripts, and more
New Pipelines for Video Generation
Wan 2.1
Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. The model release includes 4 different model variants and three different pipelines for Text to Video, Image to Video and Video to Video.
Wan-AI/Wan2.1-T2V-1.3B-DiffusersWan-AI/Wan2.1-T2V-14B-DiffusersWan-AI/Wan2.1-I2V-14B-480P-DiffusersWan-AI/Wan2.1-I2V-14B-720P-Diffusers
Check out the docs here to learn more.
LTX Video 0.9.5
LTX Video 0.9.5 is the updated version of the super-fast LTX Video model series. The latest model introduces additional conditioning options, such as keyframe-based animation and video extension (both forward and backward).
To support these additional conditioning inputs, we’ve introduced the LTXConditionPipeline and LTXVideoCondition object.
To learn more about the usage, check out the docs here.
Hunyuan Image to Video
Hunyuan utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder. The input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data and seamlessly integrating information from both the image and its associated caption.
To learn more, check out the docs here.
Others
- EasyAnimateV5 (thanks to @bubbliiiing for contributing this in this PR)
- ConsisID (thanks to @SHYuanBest for contributing this in this PR)
New Pipelines for Image Generation
Sana-Sprint
SANA-Sprint is an efficient diffusion model for ultra-fast text-to-image generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4, rivaling the quality of models like Flux.
Shoutout to @lawrence-cj for their help and guidance on this PR.
Check out the pipeline docs of SANA-Sprint to learn more.
Lumina2
Lumina-Image-2.0 is a 2B parameter flow-based diffusion transformer for text-to-image generation released under the Apache 2.0 license.
Check out the docs to learn more. Thanks to @zhuole1025 for contributing this through this PR.
One can also LoRA fine-tune Lumina2, taking advantage of its Apach2.0 licensing. Check out the guide for more details.
Omnigen
OmniGen is a unified image generation model that can handle multiple tasks including text-to-image, image editing, subject-driven generation, and various computer vision tasks within a single framework. The model consists of a VAE, and a single transformer based on Phi-3 that handles text and image encoding as well as the diffusion process.
Check out the docs to learn more about OmniGen. Thanks to @staoxiao for contributing OmniGen in this PR.
Others
- CogView4 (thanks to @zRzRzRzRzRzRzR for contributing CogView4 in this PR)
New Memory Optimizations
Layerwise Casting
PyTorch supports torch.float8_e4m3fn and torch.float8_e5m2 as weight storage dtypes, but they can’t be used for computation on many devices due to unimplemented kernel support.
However, you can still use these dtypes to store model weights in FP8 precision and upcast them to a widely supported dtype such as torch.float16 or torch.bfloat16 on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. This can potentially cut down the VRAM requirements of a model by 50%.
Code
import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
model_id = "THUDM/CogVideoX-5b"
# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)Group Offloading
Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either torch.nn.ModuleList or torch.nn.Sequential), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.
On CUDA devices, we also have the option to enable using layer prefetching with CUDA Streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed which makes inference substantially faster while still keeping VRAM requirements very low. With this, we introduce the idea of overlapping computation with data transfer.
One thing to note is that using CUDA streams can cause a considerable spike in CPU RAM usage. Please ensure that the available CPU RAM is 2 times the size of the model if you choose to set use_stream=True. You can reduce CPU RAM usage by setting low_cpu_mem_usage=True. This should limit the CPU RAM used to be roughly the same as the size of the model, but will introduce slight latency in the inference process.
You can also use record_stream=True when using use_stream=True to obtain more speedups at the expense of slightly increased memory usage.
Code
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# We can utilize the enable_group_offload method for Diffusers model implementations
pipe.transformer.enable_group_offload(
onload_device=onload_device,
offload_device=offload_device,
offload_type="leaf_level",
use_stream=True
)
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)Group offloading can also be applied to non-Diffusers models such as text encoders from the transformers library.
Code
import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# For any other model implementations, the apply_group_offloading function can be used
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)Remote Components
Remote components are an experimental feature designed to offload memory-intensive steps of t...
v0.32.2
Fixes for Flux Single File loading, LoRA loading for 4bit BnB Flux, Hunyuan Video
This patch release
- Fixes a regression in loading Comfy UI format single file checkpoints for Flux
- Fixes a regression in loading LoRAs with bitsandbytes 4bit quantized Flux models
- Adds
unload_lora_weightsfor Flux Control - Fixes a bug that prevents Hunyuan Video from running with batch size > 1
- Allow Hunyuan Video to load LoRAs created from the original repository code
All commits
- [Single File] Fix loading Flux Dev finetunes with Comfy Prefix by @DN6 in #10545
- [CI] Update HF Token on Fast GPU Model Tests by @DN6 #10570
- [CI] Update HF Token in Fast GPU Tests by @DN6 #10568
- Fix batch > 1 in HunyuanVideo by @hlky in #10548
- Fix HunyuanVideo produces NaN on PyTorch<2.5 by @hlky in #10482
- Fix hunyuan video attention mask dim by @a-r-r-o-w in #10454
- [LoRA] Support original format loras for HunyuanVideo by @a-r-r-o-w in #10376
- [LoRA] feat: support loading loras into 4bit quantized Flux models. by @sayakpaul in #10578
- [LoRA] clean up
load_lora_into_text_encoder()andfuse_lora()copied from by @sayakpaul in #10495 - [LoRA] feat: support
unload_lora_weights()for Flux Control. by @sayakpaul in #10206 - Fix Flux multiple Lora loading bug by @maxs-kan in #10388
- [LoRA] fix: lora unloading when using expanded Flux LoRAs. by @sayakpaul in #10397
v0.32.1
TorchAO Quantizer fixes
This patch release fixes a few bugs related to the TorchAO Quantizer introduced in v0.32.0.
- Importing Diffusers would raise an error in PyTorch versions lower than 2.3.0. This should no longer be a problem.
- Device Map does not work as expected when using the quantizer. We now raise an error if it is used. Support for using device maps with different quantization backends will be added in the near future.
- Quantization was not performed due to faulty logic. This is now fixed and better tested.
Refer to our documentation to learn more about how to use different quantization backends.
All commits
- make style for #10368 by @yiyixuxu in #10370
- fix test pypi installation in the release workflow by @sayakpaul in #10360
- Fix TorchAO related bugs; revert device_map changes by @a-r-r-o-w in #10371
Diffusers 0.32.0: New video pipelines, new image pipelines, new quantization backends, new training scripts, and more
hunyuan-output.mp4
This release took a while, but it has many exciting updates. It contains several new pipelines for image and video generation, new quantization backends, and more.
Going forward, to provide more transparency to the community about ongoing developments and releases in Diffusers, we will be making use of a roadmap tracker.
New Video Generation Pipelines 📹
Open video generation models are on the rise, and we’re pleased to provide comprehensive integration support for all of them. The following video pipelines are bundled in this release:
Check out this section to learn more about the fine-tuning options available for these new video models.
New Image Generation Pipelines
- SANA
- Flux Control (including Control LoRA)
- Flux Redux
- Flux Fill Inpainting / Outpainting
- Flux RF-Inversion
- SD3.5 ControlNet
- ControlNet Union XL
- SD3.5 IP Adapter
- Flux IP adapter
Important Note about the new Flux Models
We can combine the regular Flux.1 Dev LoRAs with Flux Control LoRAs, Flux Control, and Flux Fill. For example, you can enable few-steps inference with Flux Fill using:
from diffusers import FluxFillPipeline
from diffusers.utils import load_image
import torch
pipe = FluxFillPipeline.from_pretrained(
"black-forest-labs/FLUX.1-Fill-dev", torch_dtype=torch.bfloat16
).to("cuda")
adapter_id = "alimama-creative/FLUX.1-Turbo-Alpha"
pipe.load_lora_weights(adapter_id)
image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup.png")
mask = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup_mask.png")
image = pipe(
prompt="a white paper cup",
image=image,
mask_image=mask,
height=1632,
width=1232,
guidance_scale=30,
num_inference_steps=8,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-fill-dev.png")To learn more, check out the documentation.
Note
SANA is a small model compared to other models like Flux and Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. We support LoRA fine-tuning of SANA. Check out this section for more details.
Acknowledgements
- Shoutout to @lawrence-cj and @chenjy2003 for contributing SANA in this PR. SANA also features a Deep Compression Autoencoder, which was contributed by @lawrence-cj in this PR.
- Shoutout to @guiyrt for contributing SD3.5 IP Adapter in this PR.
New Quantization Backends
Please be aware of the following caveats:
- TorchAO quantized checkpoints cannot be serialized in
safetensorscurrently. This may change in the future. - GGUF currently only supports loading pre-quantized checkpoints into models in this release. Support for saving models with GGUF quantization will be added in the future.
New training scripts
This release features many new training scripts for the community to play:
All commits
- post-release 0.31.0 by @sayakpaul in #9742
- fix bug in
require_accelerate_version_greaterby @faaany in #9746 - [Official callbacks] SDXL Controlnet CFG Cutoff by @asomoza in #9311
- [SD3-5 dreambooth lora] update model cards by @linoytsaban in #9749
- config attribute not foud error for FluxImagetoImage Pipeline for multi controlnet solved by @rshah240 in #9586
- Some minor updates to the nightly and push workflows by @sayakpaul in #9759
- [Docs] fix docstring typo in SD3 pipeline by @shenzhiy21 in #9765
- [bugfix] bugfix for npu free memory by @leisuzz in #9640
- [research_projects] add flux training script with quantization by @sayakpaul in #9754
- Add a doc for AWS Neuron in Diffusers by @JingyaHuang in #9766
- [refactor] enhance readability of flux related pipelines by @Luciennnnnnn in #9711
- Added Support of Xlabs controlnet to FluxControlNetInpaintPipeline by @SahilCarterr in #9770
- [research_projects] Update README.md to include a note about NF5 T5-xxl by @sayakpaul in #9775
- [Fix] train_dreambooth_lora_flux_advanced ValueError: unexpected save model: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> by @rootonchair in #9777
- [Fix] remove setting lr for T5 text encoder when using prodigy in flux dreambooth lora script by @biswaroop1547 in #9473
- [SD 3.5 Dreambooth LoRA] support configurable training block & layers by @linoytsaban in #9762
- [flux dreambooth lora training] make LoRA target modules configurable + small bug fix by @linoytsaban in #9646
- adds the pipeline for pixart alpha controlnet by @raulc0399 in #8857
- [core] Allegro T2V by @a-r-r-o-w in #9736
- Allegro VAE fix by @a-r-r-o-w in #9811
- [CI] add new runner for testing by @sayakpaul in #9699
- [training] fixes to the quantization training script and add AdEMAMix optimizer as an option by @sayakpaul in #9806
- [training] use the lr when using 8bit adam. by @sayakpaul in #9796
- [Tests] clean up and refactor gradient checkpointing tests by @sayakpaul in #9494
- [CI] add a big GPU marker to run memory-intensive tests separately on CI by @sayakpaul in #9691
- [LoRA] fix: lora loading when using with a device_mapped model. by @sayakpaul in #9449
- Revert "[LoRA] fix: lora loading when using with a device_mapped mode… by @yiyixuxu in #9823
- [Model Card] standardize advanced diffusion training sd15 lora by @chiral-carbon in #7613
- NPU Adaption for FLUX by @leisuzz in #9751
- Fixes EMAModel "from_pretrained" method by @SahilCarterr in #9779
- Update train_controlnet_flux.py,Fix size mismatch issue in validation by @ScilenceForest in #9679
- Handling mixed precision for dreambooth flux lora training by @icsl-Jeon in #9565
- Reduce Memory Cost in Flux Training by @leisuzz in #9829
- Add Diffusion Policy for Reinforcement Learning by @DorsaRoh in #9824
- [feat] add
load_lora_adapter()for compatible models by @sayakpaul in #9712 - Refac training utils.py by @RogerSinghChugh in #9815
- [core] Mochi T2V by @a-r-r-o-w in #9769
- [Fix] Test of sd3 lora by @SahilCarterr in #9843
- Fix: Remove duplicated comma in distributed_inference.md by @vahidaskari in #9868
- Add new community pipeline for 'Adaptive Mask Inpainting', introduced in [ECCV2024] ComA by @jellyheadandrew in #9228
- Updated _encode_prompt_with_clip and encode_prompt in train_dreamboth_sd3 by @SahilCarterr in #9800
- [Core] introduce
controlnetmodule by @sayakpaul in #8768 - [Flux] reduce explicit device transfers and typecasting in flux. by @sayakpaul in #9817
- Improve downloads of sharded variants by @DN6 in #9869
- [fix] Replaced shutil.copy with shutil.copyfile by @SahilCarterr in #9885
- Enabling gradient checkpointing in eval() mode by @MikeTkachuk in #9878
- [FIX] Fix TypeError in DreamBooth SDXL when use_dora is False by @SahilCarterr in #9879
- [Advanced LoRA v1.5] fix: gradient unscaling problem by @sayakpaul in #7018
- Revert "[Flux] reduce explicit device transfers and typecasting in flux." by @sayakpaul in #9896
- Feature IP Adapter Xformers Attention Processor by @elismasilva in #9881
- Notebooks for Community Scripts Examples by @ParagEkbote in #9905
- Fix Progress Bar Updates in SD 1.5 PAG Img2Img pipeline by @painebenjamin in #9925
- Update pipeline_flux_img2img.py by @example-git in #9928
- add de...
