arxiv:2603.18524

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Published on Mar 19

· Submitted by

Lani Ko on Mar 20

#3 Paper of the day

Yonsei University

Upvote

Authors:

Abstract

A novel 3D-aware video customization framework is presented that decouples spatial geometry from temporal motion using a 1-frame optimization approach and incorporates a visual conditioning module for enhanced texture generation.

AI-generated summary

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

View arXiv page View PDF Project page GitHub 27 Add to collection

Community

lanikoworld

Paper submitter 1 day ago

Key idea:

3DreamBooth treats subjects as 3D entities, not 2D — enabling multiview-consistent video generation by baking a spatial prior through 1-frame optimization.

Highlights:

3DreamBooth decouples 3D geometry from temporal motion via 1-frame spatial optimization
3Dapter accelerates convergence with multi-view joint attention using shared weights
Outperforms single-view baselines (VACE, Phantom) on identity preservation and 3D geometric fidelity
Model-agnostic: works on both HunyuanVideo and WanVideo 2.1

avahal

1 day ago

the 3dapter as a dynamic selective router that queries view-specific geometric hints from a small reference set is the standout idea here. the asymmetrical conditioning, with a single-view pretraining stage followed by multi-view joint optimization, feels like a clean split between learning 3d identity and motion priors. my concern is how sensitive this routing is to the reference set's coverage of viewpoints; would the method degrade gracefully if a novel angle is underrepresented? the arXivLens breakdown helped me parse the method details, and i found a solid walkthrough here: https://arxivlens.com/PaperView/Details/3dreambooth-high-fidelity-3d-subject-driven-video-generation-model-5322-5abc5bbb. btw a small ablation showing what happens if 3dapter is frozen during joint fine-tuning would help isolate its contribution to geometry fidelity vs the diffusion backbone.