Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Abstract
Vanast is a unified framework that generates garment-transferred human animation videos by combining image-based virtual try-on and pose-driven animation in a single process, addressing issues like identity drift and garment distortion through triplet supervision and dual module architecture.
We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.
Community
Given a human image and one or more garment images, our method generates virtual try-on with human image animation conditioned on a pose video while preserving identity.
the dual module fusion in vanast—splitting garment transfer from pose-guided animation inside a video diffusion transformer—feels like a crisp way to keep pretrained generative quality while aligning garments to motion. i'd love to see an ablation where you remove the garment-transfer stream to quantify how much identity and garment fidelity actually come from the motion path versus the garment conditioning. the synthetic triplet supervision is bold, but i wonder how the approach handles tricky garments with non-rigid drape or accessories that weren’t well represented in the triplets. the arxivlens breakdown helped me parse the method details, especially the multi-level conditioning, and it's a nice reference if you’re planning a reproduction pass: https://arxivlens.com/PaperView/Details/vanast-virtual-try-on-with-human-image-animation-via-synthetic-triplet-supervision-942-3b31657a
awesome!
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Vanast is a unified single-step framework for garment-transferred human animation that replaces the conventional two-stage pipeline of virtual try-on followed by animation. By constructing large-scale triplet supervision data and introducing a Dual Module architecture for video diffusion transformers, Vanast preserves identity and garment accuracy while supporting zero-shot pose interpolation.
Key Idea
Existing approaches treat virtual try-on and human animation as separate stages, leading to error accumulation and inconsistent results. Vanast unifies both tasks into a single forward pass through a video diffusion transformer, directly generating an animated video of a person wearing a target garment in a target pose sequence — without any intermediate try-on image.
Method / Approach
The method relies on two key components. First, a large-scale synthetic triplet dataset is constructed, where each sample contains a reference person image, a target garment, and a target pose sequence. This provides the dense supervision needed for end-to-end training. Second, a Dual Module architecture is integrated into the video diffusion transformer — one branch encodes identity and appearance, the other encodes garment detail — enabling the model to disentangle and faithfully reconstruct both in the output video.
Results
Vanast achieves state-of-the-art results on garment-transferred human animation benchmarks, outperforming two-stage baselines in both garment fidelity and motion quality. The unified design also enables zero-shot interpolation between poses and garments not seen during training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On (2026)
- OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance (2026)
- 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model (2026)
- DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary (2026)
- Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off (2026)
- HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis (2026)
- Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.04934 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper


