DomainShuttle: Subject Driven Text-to-Video with In Domain and Cross Domain Generation

June 27, 2026

Share this post:

DomainShuttle: Subject Driven Text-to-Video with In Domain and Cross Domain Generation

Researchers at the Hong Kong University of Science and Technology have released DomainShuttle, a text-to-video model that generates scenes featuring a specific subject from a single reference image. The model offers two distinct generation modes: one that preserves the subject's exact appearance across every frame, and one that intentionally allows the subject's style to shift while keeping its identity intact.

DomainShuttle: subject driven text-to-video generation overview

DomainShuttle is built on the Wan2.2-T2V-A14B backbone, a 14 billion parameter open video model. It is released under the Apache 2.0 license, permitting commercial use. The paper was submitted to arXiv on June 24, 2026, and the model received 62 HuggingFace upvotes within three days of release.

Two Modes, One Reference Image

Most subject driven video generation models make a fixed trade off: either the generated subject looks exactly like the reference (useful for cast consistency, product placement) or the model allows enough variation to produce stylized results (useful for fantasy sequences, animated segments). The two goals are technically in tension because preserving identity means suppressing style variation, and enabling style variation means loosening identity preservation.

DomainShuttle resolves this by separating the two as distinct operating modes. A filmmaker provides a single reference image and a text prompt, then selects which mode applies to the shot. The model handles both from the same architecture without retraining.

In Domain Generation: Exact Subject Fidelity

In domain mode is designed for shots where the subject must look identical to the reference across every frame, regardless of background, lighting, or camera angle. This is the mode for character consistency across a sequence of shots or product placement where brand fidelity is required.

In-domain: multiple objects preserved across frames

In-domain: human subject and object fidelity

The multi object example shows two reference subjects maintained simultaneously across the generated frames, each retaining its specific features as the scene changes around them. The human subject example preserves facial features and clothing texture through motion, demonstrating the kind of consistency required for a believable performance across cuts.

Cross Domain Generation: Style Variation with Subject Consistency

Cross domain mode allows the generated output to exist in a different visual register from the reference image while the subject's core identity persists across both. The reference provides the subject's fundamental structure; the text prompt describes the stylistic world it should inhabit.

Cross-domain: fantasy reference to photorealistic output

Cross-domain: photorealistic reference to fantasy output

The first example converts a fantasy styled reference into a photorealistic generated scene. The second reverses the direction, turning a photorealistic reference into a fantasy rendered output. Both examples retain the subject's identity across the domain shift, which is the capability that prior open source character driven models have not provided in a single unified framework. Earlier work on character consistency, including approaches like BindWeave, focused on maintaining fidelity within a single visual register. Cross domain transfer in both directions is DomainShuttle's distinguishing capability.

The Architecture

DomainShuttle introduces three specific technical components to achieve dual mode generation on the Wan2.2 backbone.

The first is Domain MoT (Domain Mixture of Tasks). This component decouples domain aware features from subject identity features during encoding. Instead of treating the reference image as a single unified signal, Domain MoT separates what the subject is from what visual world it belongs to, allowing the generation head to apply either or both independently.

The second is Video Reference DualRoPE, a positional encoding scheme that separates the reference image tokens from the video generation tokens in the attention mechanism. Standard positional encodings mix reference and generation frames spatially, which can cause the model to blur subject features into background context. DualRoPE keeps them distinct throughout the generation pass.

The third is Cross Pair Consistent Loss, a training objective that extracts subject features from pairs of images with different lighting, backgrounds, and poses. By training on variation, the model learns to identify which features are intrinsic to the subject and which are incidental to the capture conditions.

The Wan model family underpins the generation backbone. At 14 billion parameters, Wan2.2-T2V-A14B is among the strongest open video model foundations available as of mid-2026, and DomainShuttle extends it for subject-driven use cases without modifying the core weights.

Available Under Apache 2.0

The codebase is available on GitHub from HKUST-C4G and the model weights on HuggingFace under Apache 2.0, permitting commercial use with no attribution requirement beyond license inclusion. The paper gives 480p and 720p as supported inference resolutions.

The institutional backing matters for production teams assessing reliability. HKUST ranks consistently among the top computer science research institutions in Asia, and the 10-author team represents a sustained research effort rather than a solo release. The 62 HuggingFace upvotes in three days signals rapid pickup from the practitioner community.

Filmmakers building character-consistent sequences can experiment with text-to-video and image-to-video generation directly in AI FILMS Studio's video workspace.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: DomainShuttle: Revisiting Domain in Subject-Driven Video Generation GitHub: HKUST-C4G/DomainShuttle HuggingFace: CNcreator0331/DomainShuttle_weight Project Page: cn-makers.github.io/DomainShuttle

Continue Reading

Jun 28, 2026

A24 Defends Google AI Partnership After Fan Backlash

A24 responded to fan backlash over its $75 million Google DeepMind deal, saying it partnered to 'dictate what tools get built for artists.'

Jun 28, 2026

AI for Good Film Festival 2026: 1,300 Submissions, 10 Finalists, Geneva July 9

The ITU AI for Good Film Festival 2026 selected 10 finalist films from 1,300+ submissions across 10 countries, screening July 9 in Geneva.

Jun 28, 2026

FastWan-QAD: Ultrafast Open Source Video Generation by Hao AI Lab

FastWan-QAD by Hao AI Lab generates a 5 second 480p video in 3.4 seconds on RTX 4090 using quantization aware distillation. Apache 2.0.

DomainShuttle: Subject Driven Text-to-Video with In Domain and Cross Domain Generation

DomainShuttle: Subject Driven Text-to-Video with In Domain and Cross Domain Generation

Two Modes, One Reference Image

In Domain Generation: Exact Subject Fidelity

Cross Domain Generation: Style Variation with Subject Consistency

The Architecture

Available Under Apache 2.0

Sources

Continue Reading

A24 Defends Google AI Partnership After Fan Backlash

AI for Good Film Festival 2026: 1,300 Submissions, 10 Finalists, Geneva July 9

FastWan-QAD: Ultrafast Open Source Video Generation by Hao AI Lab

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects