3DreamBooth: 3D Video Generation from Object Photos

March 20, 2026

Share this post:

3DreamBooth: 3D Video Generation from Object Photos

3DreamBooth demo: subject driven 3D aware video generation from reference object photos

Researchers at Yonsei University and Sungkyunkwan University published 3DreamBooth on March 19, 2026. The framework generates videos of specific objects by treating those objects as three dimensional shapes rather than flat images. Existing video customization methods train on 2D representations and lose consistency when the camera moves or the subject rotates. 3DreamBooth encodes the object's spatial geometry during training so shape and texture hold across every frame.

The Problem with Flat Object References

Most video models that accept a reference image learn what a subject looks like from one angle. When the camera moves or the object is handled, the model has no spatial model to draw from and the output degrades. This is especially visible in product videos, where an item needs to be shown from multiple sides or held by a person.

3DreamBooth's authors argue that subjects exist in three dimensional space and that treating them as 2D entities is the root cause of view inconsistency in generated video. The system uses a token identifier written as [v] to represent the specific object. Users supply reference photos from several angles, train a compact LoRA adapter, then write prompts using [v] in place of the object name.

How 3DreamBooth Works

The method separates training into two components. The first is spatial geometry optimization: a single frame training paradigm that isolates the object's 3D structure without requiring large multiview video datasets. By restricting parameter updates to spatial representations only, the system avoids temporal overfitting from limited training clips.

The second component is 3Dapter, a visual conditioning module that routes geometric information from multiview reference images into the generation process using shared attention weights. This asymmetric conditioning approach lets the model generalize from a single view during pretraining while drawing on multiple input angles at inference time.

The framework runs on Hunyuan Video as its primary base model. The researchers also demonstrated compatibility with WanVideo 2.1 at 720p resolution, confirming the method is not tied to one architecture.

Benchmark Results

The team introduced 3D-CustomBench, an evaluation suite measuring subject fidelity in generated video. 3DreamBooth scores 0.8871 on CLIP-I and 0.7420 on DINO-I for visual similarity. In human evaluation, raters scored shape accuracy at 4.80 out of 5, compared to 4.39 for the best single view baseline. Color fidelity reached 4.53 versus 4.09, and fine grained detail scored 4.04 versus 3.35.

Product Video Demos

The three examples below show 3DreamBooth applied to commercial product scenarios. Each shows the reference input photos, the prompt used to direct the scene, and the generated video output.

Silver Bag on the Red Carpet

Reference input photos of a silver bag from multiple angles used to train 3DreamBooth — Reference input photos: silver bag

Prompt: An actress on a Hollywood red carpet raising a [v] bag and peering over it into the camera.

Generated output: silver bag on red carpet scene

Digital Watch in an Autumn Setting

Reference input photos of a digital watch from multiple angles used to train 3DreamBooth — Reference input photos: digital watch

Prompt: A video of hands rotating a [v] watch over autumn leaves, rustic stone wall and red apples behind.

Generated output: watch rotating in autumn setting

Snack Bag in a Supermarket

Reference input photos of a snack bag from multiple angles used to train 3DreamBooth — Reference input photos: snack bag

Prompt: A woman reviewing a [v] snack bag in a bright supermarket aisle, filmed by a foreground cameraman.

Generated output: snack bag product review in supermarket

What This Means for Filmmakers and Brands

Product video is one of the highest volume use cases for AI video generation today. Most tools require a single reference image and hold up for static shots but degrade when a product needs to be held, rotated, or shown from multiple sides. 3DreamBooth's spatial encoding holds geometry across that motion, which matters for anything involving hands interacting with an object or camera moves around a product.

For virtual production, the same logic applies to props and set dressings that must appear consistently across multiple shots. A studio can generate spatially consistent prop video for previz, lookbooks, and client presentations without shipping physical assets. For generating the 3D geometry of those props from a single reference photo, Pixal3D from Tencent ARC Lab converts any image into a production ready 3D mesh with PBR textures under an MIT license, accepted to SIGGRAPH 2026. The approach also connects to related research in reference image video generation, such as BindWeave's character consistency method and the First Frame Go video customization technique, both of which tackle the problem of binding a specific visual identity to generated video output.

Generate product and creative videos with AI FILMS Studio.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: 3DreamBooth: Subject-Driven 3D-Aware Video Customization GitHub: Ko-Lani/3DreamBooth Project Page: ko-lani.github.io/3DreamBooth

Continue Reading

Jun 26, 2026

Happy Horse 1.1 Tutorial: Text to Video and Image to Video

Step by step guide to Happy Horse 1.1 on AI FILMS Studio. Generate cinematic 720p and 1080p video from text or reference images with Alibaba's updated model.

Jun 25, 2026

Krea 2 Raw and Turbo: Open Weight 12B Image Generation with a Commercial Community License

Krea 2 Raw and Turbo bring open weight 12B image generation to filmmakers, with 2K output in 2 seconds and free commercial use for studios under 50 seats.