Fantasy AIGC Family is an open-source initiative exploring Human-centric AI, World Modeling, and Human-World Interaction, aiming to bridge perception, understanding, and generation in the real and digital worlds.
- 🏛 Jan 2026 – FantasyVLN is accepted by CVPR 2026.
- 🏛 Jan 2026 – FantasyWorld is accepted by ICLR 2026.
- 📢 Jan 2026 – We released the training and inference code and model weights of FantasyVLN.
- 🏆 Dec 2025 - FantasyWorld ranked 1st on the WorldScore Leaderboard (by Stanford Prof. Fei-Fei Li's Team), validating our approach against global state-of-the-art models.
- 🏛 Nov 2025 – Two papers from our family, FantasyTalking2 and FantasyHSI, have been accepted to AAAI 2026.
- 🏛 Nov 2025 – Two papers from our family, FantasyTalking2 and FantasyHSI, have been accepted to AAAI 2026.
- 🏛 Jul 2025 – FantasyTalking is accepted by ACM MM 2025.
- 📢 Apr 2025 – We released the inference code and model weights of FantasyTalking and FantasyID.
A unified multimodal Chain-of-Thought (CoT) reasoning framework that internalizes the inference capabilities of world models into the VLN architecture, enabling efficient and precise navigation based on natural language instructions and visual observations.
Corresponds to the "Worlds" dimension. A unified world model integrating video priors and geometric grounding for synthesizing explorable and geometrically consistent 3D scenes. It emphasizes spatiotemporal consistency driven by Action and serves as a verifiable structural anchor for spatial intelligence.
The first Wan-based high-fidelity audio-driven avatar system that synchronizes facial expressions, lip motion, and body gestures in dynamic scenes through dual-stage audio-visual alignment and controllable motion modulation.
A novel Timestep-Layer Adaptive Multi-Expert Preference Optimization (TLPO) method enhances the quality of audio-driven avatar in three dimensions: lip-sync, motion naturalness, and visual quality.
A novel expression-driven video-generation method that pairs emotion-enhanced learning with masked cross-attention, enabling the creation of high-quality, richly expressive animations for both single and multi-portrait scenarios.
Corresponds to the "Interaction" dimension. A graph-based multi-agent framework that grounds video generation within 3D world dynamics. It unifies the action space with a broader interaction loop, transforming video generation from a content endpoint into a control channel for interactive systems.
A tuning-free text-to-video model that leverages 3D facial priors, multi-view augmentation, and layer-aware guidance injection to deliver dynamic, identity-preserving video generation.
- Giving Back to the Community: In our daily work, we benefit immensely from the resources, expertise, and support of the open source community, and we aim to give back by making our own projects open source.
- Attracting More Contributors: By open sourcing our code, we invite developers worldwide to collaborate—making our models smarter, our engineering more robust, and extending benefits to even more users.
- Building an Open Ecosystem: We believe that open source brings together diverse expertise to create a collaborative innovation platform—driving technological progress, industry growth, and broader societal impact.