3DreamBooth: 3D Video Generation from Object Photos
Share this post:
3DreamBooth: 3D Video Generation from Object Photos
3DreamBooth demo: subject driven 3D aware video generation from reference object photos
Researchers at Yonsei University and Sungkyunkwan University published 3DreamBooth on March 19, 2026. The framework generates videos of specific objects by treating those objects as three dimensional shapes rather than flat images. Existing video customization methods train on 2D representations and lose consistency when the camera moves or the subject rotates. 3DreamBooth encodes the object's spatial geometry during training so shape and texture hold across every frame.
The Problem with Flat Object References
Most video models that accept a reference image learn what a subject looks like from one angle. When the camera moves or the object is handled, the model has no spatial model to draw from and the output degrades. This is especially visible in product videos, where an item needs to be shown from multiple sides or held by a person.
3DreamBooth's authors argue that subjects exist in three dimensional space and that treating them as 2D entities is the root cause of view inconsistency in generated video. The system uses a token identifier written as [v] to represent the specific object. Users supply reference photos from several angles, train a compact LoRA adapter, then write prompts using [v] in place of the object name.
How 3DreamBooth Works
The method separates training into two components. The first is spatial geometry optimization: a single frame training paradigm that isolates the object's 3D structure without requiring large multiview video datasets. By restricting parameter updates to spatial representations only, the system avoids temporal overfitting from limited training clips.
The second component is 3Dapter, a visual conditioning module that routes geometric information from multiview reference images into the generation process using shared attention weights. This asymmetric conditioning approach lets the model generalize from a single view during pretraining while drawing on multiple input angles at inference time.
The framework runs on Hunyuan Video as its primary base model. The researchers also demonstrated compatibility with WanVideo 2.1 at 720p resolution, confirming the method is not tied to one architecture.
Benchmark Results
The team introduced 3D-CustomBench, an evaluation suite measuring subject fidelity in generated video. 3DreamBooth scores 0.8871 on CLIP-I and 0.7420 on DINO-I for visual similarity. In human evaluation, raters scored shape accuracy at 4.80 out of 5, compared to 4.39 for the best single view baseline. Color fidelity reached 4.53 versus 4.09, and fine grained detail scored 4.04 versus 3.35.
Product Video Demos
The three examples below show 3DreamBooth applied to commercial product scenarios. Each shows the reference input photos, the prompt used to direct the scene, and the generated video output.
Silver Bag on the Red Carpet
Prompt: An actress on a Hollywood red carpet raising a [v] bag and peering over it into the camera.
Generated output: silver bag on red carpet scene
Digital Watch in an Autumn Setting
Prompt: A video of hands rotating a [v] watch over autumn leaves, rustic stone wall and red apples behind.
Generated output: watch rotating in autumn setting
Snack Bag in a Supermarket
Prompt: A woman reviewing a [v] snack bag in a bright supermarket aisle, filmed by a foreground cameraman.
Generated output: snack bag product review in supermarket
What This Means for Filmmakers and Brands
Product video is one of the highest volume use cases for AI video generation today. Most tools require a single reference image and hold up for static shots but degrade when a product needs to be held, rotated, or shown from multiple sides. 3DreamBooth's spatial encoding holds geometry across that motion, which matters for anything involving hands interacting with an object or camera moves around a product.
For virtual production, the same logic applies to props and set dressings that must appear consistently across multiple shots. A studio can generate spatially consistent prop video for previz, lookbooks, and client presentations without shipping physical assets. The approach also connects to related research in reference image video generation, such as BindWeave's character consistency method and the First Frame Go video customization technique, both of which tackle the problem of binding a specific visual identity to generated video output.
Generate product and creative videos with AI FILMS Studio.
Sources
arXiv: 3DreamBooth: Subject-Driven 3D-Aware Video Customization GitHub: Ko-Lani/3DreamBooth Project Page: ko-lani.github.io/3DreamBooth

