3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
Abstract
A novel 3D-aware video customization framework is presented that decouples spatial geometry from temporal motion using a 1-frame optimization approach and incorporates a visual conditioning module for enhanced texture generation.
Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/
Community
Key idea:
- 3DreamBooth treats subjects as 3D entities, not 2D — enabling multiview-consistent video generation by baking a spatial prior through 1-frame optimization.
Highlights:
- 3DreamBooth decouples 3D geometry from temporal motion via 1-frame spatial optimization
- 3Dapter accelerates convergence with multi-view joint attention using shared weights
- Outperforms single-view baselines (VACE, Phantom) on identity preservation and 3D geometric fidelity
- Model-agnostic: works on both HunyuanVideo and WanVideo 2.1
the 3dapter as a dynamic selective router that queries view-specific geometric hints from a small reference set is the standout idea here. the asymmetrical conditioning, with a single-view pretraining stage followed by multi-view joint optimization, feels like a clean split between learning 3d identity and motion priors. my concern is how sensitive this routing is to the reference set's coverage of viewpoints; would the method degrade gracefully if a novel angle is underrepresented? the arXivLens breakdown helped me parse the method details, and i found a solid walkthrough here: https://arxivlens.com/PaperView/Details/3dreambooth-high-fidelity-3d-subject-driven-video-generation-model-5322-5abc5bbb. btw a small ablation showing what happens if 3dapter is frozen during joint fine-tuning would help isolate its contribution to geometry fidelity vs the diffusion backbone.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation (2026)
- Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion (2026)
- Real-Time Human Frontal View Synthesis from a Single Image (2026)
- 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation (2026)
- WildActor: Unconstrained Identity-Preserving Video Generation (2026)
- UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models (2026)
- HECTOR: Hybrid Editable Compositional Object References for Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper