Mix Any Image and Video: BiCo Breaks Composition Barriers

Share this post:
Mix Any Image and Video: BiCo Breaks Composition Barriers
Video Demo: BiCo in action. See how visual concepts from multiple sources compose into single coherent outputs
Combining a beagle with a bartender. Merging Minecraft landscapes with erupting volcanoes and flying butterflies. Putting a husky and Doberman into an urban rooftop scene. These aren't creative writing prompts. They're actual outputs from BiCo, a new visual concept composition system from Hong Kong University of Science and Technology.
Released as an arXiv preprint in December 2025, BiCo (Bind & Compose) introduces what researchers call "concept prompt binding"—a method that lets you extract visual concepts from multiple images and videos, then compose them into a single coherent video output. Unlike previous approaches that struggle with complex multi source compositions, BiCo achieves this through a one shot process that binds each visual concept directly to its corresponding text token.
The Problem: Concept Collision in Video Generation
Current text-to-video and image-to-video systems face a fundamental challenge when combining multiple visual elements. When you feed multiple concepts into these models, they tend to blend together unpredictably or one concept dominates while others fade into the background.
Try generating "a beagle dog mixing drinks at a bar with a cityscape window" with existing methods, and you might get a beagle near a bar, or a bar scene with dog like elements, but rarely both concepts precisely represented with the beagle actually performing the bartending action from your reference video.
This happens because traditional methods lack mechanisms to decompose complex visual inputs into distinct, controllable concepts. The models process everything as a unified visual signal rather than separating "what is the subject" from "what is the action" from "what is the environment."
BiCo's Solution: Hierarchical Binding with Smart Training
The HKUST team—Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, and Anyi Rao—designed BiCo around three core innovations that address these composition challenges.
Hierarchical Binder Structure
BiCo uses a hierarchical architecture for cross attention conditioning in Diffusion Transformers. Rather than encoding an entire image or video as a single embedding, the system breaks down visual inputs into component concepts at multiple levels of abstraction.
When you provide an image of a beagle and a video of someone bartending, BiCo's hierarchical binders extract: the beagle's appearance (fur color, facial features, body structure), the bartending motion (arm movements, shaker manipulation, pouring gestures), and the environmental context (bar setting, cityscape window, lighting).
Each of these concepts gets bound to specific tokens in the target prompt. This binding happens through learned attention mechanisms that map visual features to their corresponding text descriptions.
Diversify and Absorb Mechanism
Training concept binding models presents a chicken and egg problem. To bind concepts accurately, you need precise text prompts that describe only the relevant concept. But real world images contain many visual details that aren't part of your target concept.
BiCo solves this through its Diversify and Absorb Mechanism. During training, the system uses diversified prompts—multiple different text descriptions for the same visual concept. This helps the model learn which visual features consistently matter across different descriptions.
The "absorb" component introduces an extra token that captures concept irrelevant details. Think of it as a conceptual trash bin. When the model encounters visual elements that don't correspond to any token in the target prompt, those features get absorbed by this special token rather than corrupting the binding of actual concepts.
For example, when binding a beagle concept from an image that also contains grass, flowers, and sunlight, the Diversify and Absorb Mechanism ensures only the beagle's visual features bind to "beagle" tokens while background elements get absorbed away.
Temporal Disentanglement Strategy
Composing video concepts presents unique challenges beyond static images. Video introduces temporal dynamics—motion, camera movement, object interactions over time. These temporal patterns need to be captured and transferred without losing the static appearance details.
BiCo's Temporal Disentanglement Strategy splits video concept training into two stages. The first stage uses a single branch binder focused solely on capturing static appearance from video frames. The second stage adds a temporal branch that learns motion patterns while the appearance branch remains frozen.
This dual branch structure allows BiCo to separately control appearance (what the subject looks like) and dynamics (how the subject moves). When composing multiple video sources, you can mix the static appearance from one video with the motion pattern from another.
Real World Composition Examples
The BiCo project page demonstrates the system's capabilities across scenarios that would challenge traditional methods.
Example 1: Minecraft Butterfly Volcano
Inputs:
Image 1: Minecraft landscape with river, trees, waterfall, and blocky clouds
Image 2: Erupting volcano with vibrant red lava and dramatic ash cloud
Input Video: Butterfly resting on yellow flower, flapping wings softly
BiCo Output:
Result: Butterfly flapping wings in blocky Minecraft landscape with erupting volcano
The composed video preserves the butterfly's wing flapping motion from the input video, the distinctive pixelated aesthetic from the Minecraft image, and the dramatic lava eruption from the volcano image. All three concepts coexist in a single coherent scene.
Example 2: Beagle Bartender
Inputs:
Input Image: Beagle with collar standing on grass pathway
Input Video: Bartender skillfully mixing drink in shaker at bar
BiCo Output:
Result: Beagle dog mixing drink vigorously with shaker using paws at bar with cityscape window
This composition transfers the bartending motion to the beagle while maintaining the dog's distinctive appearance and the bar environment. The beagle's paws convincingly manipulate the cocktail shaker with the same vigorous motion from the source video.
Example 3: Two Dogs on a Rooftop
Inputs:
Image 1: Husky with striking blue eyes standing in snow
Image 2: Poised black Doberman Pinscher against green foliage
Input Video: Person in suit standing on rooftop as another appears behind, cityscape background
BiCo Output:
Result: Husky stands on rooftop as Doberman appears behind, cityscape with blue sky
BiCo successfully replaces both human subjects with the two distinct dogs while preserving the spatial relationships (foreground/background positioning), timing (when the second subject enters frame), and environmental context (rooftop, cityscape, cloud-dotted sky).
Example 4: Video-to-Video Composition
Inputs:
Input Video 1: Person playing guitar in underground tunnel with graffiti wall
Input Video 2: Person in mint green suit pointing upward while holding trumpet
BiCo Output:
Result: Guitarist and trumpet player performing together, maintaining distinct appearances and actions
This video-to-video composition combines two distinct subjects with different appearances, poses, and actions into a single coherent scene. The guitarist retains the underground tunnel environment while the trumpet player appears in mint green suit, both performing simultaneously.
Technical Performance
BiCo's evaluation demonstrates advantages over existing approaches across three key metrics: concept consistency (how well visual concepts match the input sources), prompt fidelity (how accurately the output matches the text description), and motion quality (temporal coherence and naturalness).
The project page shows direct comparisons with Textual Inversion, DB-LoRA, DreamVideo, and DualReal. In scenarios like "monkey putting on headphones in a forest" or "elephant swimming underwater," BiCo consistently maintains clearer concept boundaries and more accurate concept to action binding.
The system's hierarchical binder structure enables this by processing concepts at multiple abstraction levels simultaneously. Lower level binders capture fine grained appearance details (fur texture, eye color, body proportions) while higher level binders encode semantic concepts (species identity, posture, activity).
Implementation Details
BiCo builds on Diffusion Transformers as its base architecture. The research team hasn't released the full codebase yet, but the GitHub repository (github.com/refkxh/bico) indicates future code availability.
The method requires training separate binders for each concept you want to use in composition. This training happens in the one shot regime—meaning you only need a single image or video example of each concept to create a usable binder.
Once trained, these binders can be reused across different compositions. If you've trained a binder for a specific dog's appearance and another binder for a bartending motion, you can mix these with any new environment or additional concepts without retraining.
The Temporal Disentanglement Strategy's dual branch structure means video concepts require two training stages, but this enables fine grained control over appearance versus motion when composing the final output.
Practical Applications for AI Filmmaking
BiCo's concept binding approach opens possibilities that current text-to-video systems struggle with.
Character consistency across scenes: Train a binder on your protagonist's appearance, then compose that character into different scenarios by combining with various action and environment videos. The character maintains consistent appearance while adapting to new contexts.
Motion transfer: Extract a specific motion pattern (a dance move, a gesture, a way of walking) from one video, then apply that motion to different subjects or environments. This works for both camera motion and subject motion.
Style composition: Combine the visual aesthetic from one source (like the Minecraft pixel art style in the examples) with content from another. This goes beyond style transfer by preserving both style and content boundaries clearly.
Multi subject orchestration: Place multiple distinct subjects into a single scene with precise control over each subject's appearance, position, and action. Traditional multi subject generation often blends subjects together or loses details on secondary characters.
The one shot training requirement means you can quickly create binders for new concepts without collecting large datasets or running lengthy training procedures. For filmmakers, this translates to rapid prototyping of visual ideas by mixing and matching conceptual building blocks.
Limitations and Considerations
BiCo's approach involves tradeoffs. The hierarchical binder structure and dual branch temporal modeling increase computational requirements compared to simpler methods. Each concept binding adds processing overhead.
The method also requires explicit concept decomposition. You need to decide upfront which visual elements constitute distinct concepts worth binding separately. This differs from end to end approaches that attempt to handle everything implicitly.
Complex scenes with intricate spatial relationships between many objects may challenge the current system. The examples primarily show 1 to 3 main concepts in relatively simple spatial arrangements. Dense scenes with 5+ interacting objects would test the limits of concept prompt binding.
The project page doesn't provide specific information about video resolution limits, generation time, or memory requirements. These practical constraints will matter for production use.
Research Context
BiCo joins recent efforts to improve compositional control in generative video. VideoComposer (NeurIPS 2023) introduced decomposing videos into textual, spatial, and temporal conditions. Vico (July 2024) approached compositional video as a flow equalization problem. BlobGEN-Vid (CVPR 2025) uses blob representations for compositional generation.
BiCo's contribution focuses specifically on the concept binding mechanism rather than the conditioning architecture. This makes the approach potentially compatible with different base video generation models. The hierarchical structure and Diversify and Absorb training could enhance other composition methods.
The December 2025 arXiv release positions BiCo as part of the ongoing transition from monolithic text-to-video models toward more modular, compositional approaches that give creators finer control over each element in generated videos.
Access and Future Development
The BiCo paper is available as arXiv preprint 2512.09824. The project page at refkxh.github.io/BiCo_Webpage/ contains detailed examples and comparisons. The GitHub repository (github.com/refkxh/bico) currently hosts the paper PDF with code release expected.
The research comes from Hong Kong University of Science and Technology and Chinese University of Hong Kong teams, led by Xianghao Kong, with contributions from Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, and Anyi Rao.
For filmmakers interested in compositional video generation, BiCo represents a significant step toward practical multi source video synthesis with precise concept control. The ability to bind specific visual concepts to prompt tokens, then compose those bound concepts freely, moves beyond current text-to-video limitations.
Whether you're combining Minecraft aesthetics with wildlife footage or creating impossible creature interactions, BiCo's approach to visual concept composition provides tools that previous methods couldn't deliver. The hierarchical binding, smart training mechanisms, and temporal disentanglement address real technical barriers to flexible video composition.
Try exploring visual concept combinations using AI FILMS Studio's video generation workspace to experiment with multi source video creation.
Sources:


