Ditto and Editto: Text-Based Video Editing Reaches New Capability Level
Screenshots from Ditto
Share this post:
Ditto and Editto: Text Based Video Editing Reaches New Capability Level
Video editing through text instructions represents a practical efficiency gain for content creators, but the technology has lagged behind similar capabilities for images. Researchers from HKUST, Ant Group, Zhejiang University, and Northeastern University introduce Ditto, a framework that addresses this gap through synthetic data generation at scale.
The team spent over 12,000 GPU days building Ditto-1M, a dataset containing one million high quality video editing examples. Using this data, they trained Editto, a model that follows text instructions to edit videos with what the researchers describe as SOTA accuracy.
The Data Scarcity Problem
Instruction based image editing has achieved practical usability through models like InstructPix2Pix and more recent commercial offerings. These systems can interpret text commands like "make the sky more dramatic" or "change the car to red" and apply those edits accurately to still images.
Video editing faces additional complexity. Edits must maintain temporal coherence across frames while following the instruction accurately. An object changed in one frame needs to remain consistently modified throughout the sequence. Lighting adjustments must feel natural across the entire clip.
Previous approaches to generating training data for these systems encountered persistent tradeoffs. Some methods relied on computationally expensive per video optimization, making large scale dataset creation impractical. Others used training free propagation techniques that sacrificed editing diversity or temporal consistency for scalability.
The result was insufficient high quality training data to teach models how to perform instruction based video editing reliably across diverse scenarios.
How Ditto Generates Training Data
The Ditto framework combines multiple components to generate video editing examples systematically. The process starts with source videos and uses an intelligent agent to create diverse editing instructions that cover different types of modifications.
For each source video, the system extracts a representative frame and applies an image editing model to that frame according to the instruction. This edited frame serves as a visual reference showing what the instruction should accomplish.
The framework then uses a video generation model to propagate the edit across all frames, creating the edited video sequence. A temporal enhancer improves consistency between frames, reducing artifacts that can occur when generating video.
The system employs distilled model architectures to reduce computational cost while maintaining quality. This makes large scale data generation economically viable. An automated quality control system filters outputs, ensuring only high quality examples enter the final dataset.
The Ditto-1M dataset covers two categories of editing: global transformations that affect entire videos, and local modifications that target specific regions or objects. Global edits include style transfers, color grading, and atmospheric adjustments. Local edits involve object replacement, selective enhancement, and regional modifications.
What Editto Can Do
Editto, the model trained on Ditto-1M, performs instruction based video editing tasks. Users provide a source video and a text instruction describing the desired modification. The model processes both inputs and generates an edited video that reflects the instruction.
The system handles commands like "add falling snow to the scene," "make it look like sunset," "change the car to a truck," or "add fog to the background." It interprets these natural language instructions and applies corresponding visual modifications.
The model maintains temporal coherence across frames, meaning edits appear consistent throughout the video rather than flickering or changing unexpectedly. This consistency matters for practical usability, as temporal artifacts quickly become noticeable to viewers.
Editto demonstrates what the researchers call synthetic-to-real capability. The model can take stylized or modified videos and map them back toward more photorealistic appearance, suggesting it has learned relationships between different visual domains.
Training Strategy: Curriculum Learning
The researchers employed curriculum learning to train Editto, starting with easier tasks and progressively increasing difficulty. This approach addresses a challenge inherent in their data generation process.
During data creation, the system uses edited reference images to guide video generation. However, during real world use, only text instructions are available. The model must learn to interpret instructions without visual references.
The curriculum begins by providing both text instructions and edited reference images during training. This gives the model a visual scaffold showing what the instruction means. As training progresses, the system gradually reduces reliance on visual guidance, forcing the model to extract meaning from text alone.
By the end of training, the model can perform edits based solely on text instructions, having learned the mapping between language and visual modifications through the scaffolded approach.
Technical Architecture Details
Editto builds on diffusion based video generation architectures. The model processes source videos through a series of denoising steps, with text instructions conditioning the generation process at each step.
The temporal enhancer component specifically addresses frame-to-frame consistency. Video generation models can produce individual frames that look correct in isolation but show discontinuities when played in sequence. The enhancer reduces these artifacts through architectural modifications that encourage temporal coherence.
The system uses distilled models for efficiency. Model distillation transfers knowledge from larger, more capable models into smaller architectures that run faster with less computational resources. This makes the data generation pipeline practical at the scale required for one million examples.
Quality control during data generation involves automated assessment of multiple factors: instruction video alignment, temporal consistency, visual quality, and aesthetic appeal. The system filters out examples that fail to meet thresholds on these metrics.
How AI Filmmakers Can Use This Technology
For content creators and filmmakers, instruction based video editing offers practical workflow benefits. The technology enables rapid iteration on visual treatments without manual frame by frame editing.
Consider previsualization scenarios where you need to test different visual approaches quickly. Instead of manually color grading multiple versions, you can generate variations through text instructions: "make it feel like a thriller," "apply 1970s film aesthetic," "create a dreamlike atmosphere." Each instruction produces a complete edited version in minutes rather than hours.
Location modification becomes more accessible. If you shot footage in one setting but want to test how it would look in different conditions, text instructions can add environmental effects: "add rain," "make it nighttime," "create morning mist." These modifications help evaluate creative decisions before committing to expensive reshoots or extensive VFX work.
Style transfer applications allow testing different visual treatments on existing footage. You can experiment with film stocks, lighting moods, or artistic styles through text descriptions. This capability supports creative exploration during the planning and early editing phases.
Object and element modifications enable practical adjustments. Changing specific items in scenes, adding or removing visual elements, or altering particular aspects of the composition become text-command operations rather than complex masking and compositing work.
Current Limitations and Practical Considerations
The technology shows specific strengths and limitations that affect practical deployment. Editto performs well on editing tasks similar to those in its training data but may struggle with novel modifications outside that distribution.
Complex instructions requiring multiple simultaneous changes can produce less reliable results than simple, focused commands. Breaking complex edits into sequential steps often works better than attempting everything in one instruction.
Temporal consistency improves significantly over earlier methods but still occasionally shows artifacts, particularly in longer videos or with rapid motion. The system works most reliably on clips of moderate length with relatively stable scenes.
The model requires computational resources appropriate to video generation tasks. Processing time scales with video length and resolution. For production use, this means planning processing time into workflows rather than expecting instant results.
Fine grained control over specific visual details proves challenging through text instructions alone. While the system can apply broad modifications effectively, achieving precise artistic intent for every aspect may require multiple iterations or combination with traditional editing tools.
Comparing Ditto to Related Work
Several concurrent research efforts address instruction based video editing from different angles. Understanding these approaches helps contextualize what Ditto contributes.
Some systems use zero shot methods that modify pre-trained models for video editing without requiring specialized training data. These approaches offer flexibility but often produce less consistent results than models trained specifically for editing tasks.
Other frameworks employ reinforcement learning to optimize editing quality through feedback signals. This can improve alignment between instructions and outputs but requires careful reward function design and substantial computational resources.
Ditto's contribution centers on scalable data generation. By creating a large, high quality dataset through automated pipelines, the researchers enable standard supervised learning approaches to achieve strong results. This avoids some complexity of zero shot or reinforcement learning methods while producing models that generalize well across editing scenarios.
The synthetic-to-real capability distinguishes Editto from pure generation models. Rather than only creating stylized outputs, the system learns mappings between different visual domains, enabling more versatile editing capabilities.
Implementation and Availability
The research team released Ditto-1M as an open dataset, providing access to the one million video editing examples. This enables other researchers to train models or develop new approaches using the same foundation.
The Editto model weights are available for download, allowing developers to integrate the editing capabilities into applications or build upon the architecture. The code repository includes training scripts, inference code, and implementation details.
The project page at editto.net provides examples demonstrating various editing capabilities, including global and local transformations across different video types. These examples help developers understand what the system can accomplish and inform decisions about integration.
For technical implementation, the researchers provide a ComfyUI workflow file that integrates Editto into the popular node based interface used by many AI content creation tools. This lowers the barrier for creators who want to experiment with the technology without building custom applications.
Workflow Integration for Filmmakers
Instruction based video editing fits into production workflows at several points. Understanding where this technology provides value helps determine appropriate use cases.
During pre-production and planning, rapid generation of visual variations supports creative decision making. Directors and cinematographers can test different treatments on reference footage, helping align creative vision before production begins.
In post-production, the technology enables efficient exploration of editorial choices. Editors can generate multiple versions of sequences with different visual treatments, accelerating the process of finding the right aesthetic approach.
For previsualization and animatics, text based editing allows quick iteration on temporary footage. Creating placeholder versions of scenes with appropriate visual styling helps communicate creative intent to stakeholders without investing in finished VFX.
Marketing and promotional content benefits from the ability to generate variations efficiently. Creating different versions of trailers or promotional clips for different audiences or platforms becomes less labor intensive when modifications can be specified through text instructions.
The technology works best as a complement to traditional editing tools rather than a replacement. Combining instruction based editing for broad modifications with precise manual control for critical moments provides flexibility and efficiency.
Technical Challenges Addressed
The Ditto framework solves several technical problems that previously limited instruction based video editing. Understanding these solutions clarifies what the research contributes.
The diversity fidelity tradeoff plagued earlier approaches. Methods that produced diverse editing results often sacrificed visual quality, while systems that maintained high quality couldn't generate varied modifications. Ditto's pipeline combines strong image editing models with video generation, capturing diversity from the image domain while maintaining temporal coherence in video.
Cost-efficiency presented another barrier to large scale data generation. Creating one million high quality video editing examples would be prohibitively expensive using per video optimization methods. Ditto's distilled architectures and automated quality control make this scale practical.
Instruction generation and quality filtering required intelligent automation. Manually creating diverse, high quality instructions for one million videos would require impractical human effort. The framework employs language models to generate varied instructions and automatically assess output quality at scale.
The visual-to-textual transition challenge emerged from the data generation method. Since the pipeline uses visual references during data creation but users provide only text during deployment, the model must learn to bridge these modalities. Curriculum learning provides a systematic way to teach this transition.
Performance Benchmarks and Comparisons
The researchers evaluated Editto against existing instruction based video editing methods across multiple metrics. Results show improvements in instruction following accuracy, temporal consistency, and visual quality.
Human evaluators assessed how well edited videos matched the intent of text instructions. Editto achieved higher preference ratings compared to baseline methods, indicating better instruction comprehension and execution.
Automated metrics measured temporal consistency by analyzing frame-to-frame differences and optical flow. Editto demonstrated reduced temporal artifacts compared to systems that lack specialized temporal enhancement components.
Visual quality assessments evaluated sharpness, color fidelity, and absence of generation artifacts. The model maintained quality comparable to or better than existing methods while improving instruction adherence.
The benchmarks included both global and local editing tasks across diverse video types. Performance remained consistent across different editing categories, suggesting the training data's breadth supports generalization.
Future Development Directions
Several research directions could extend Ditto's capabilities and address remaining limitations. The researchers and broader community will likely explore these areas.
Longer video support would enable editing of complete scenes or short films rather than just clips. This requires architectural changes to handle extended temporal dependencies and possibly hierarchical processing approaches.
Interactive refinement capabilities would allow users to provide feedback and iterate on results conversationally. Rather than generating a single output from an instruction, the system could support back and forth adjustments until achieving the desired result.
Multi-modal conditioning that combines text instructions with sketches, reference images, or other guidance modalities might improve control and precision. This would give creators multiple ways to specify intent, using whichever communication method best captures their vision.
Realtime or near realtime processing would expand use cases into interactive applications and live workflows. Current processing times suit batch workflows but limit realtime creative exploration.
Integration with traditional editing software through plugins or extensions would make the technology more accessible to existing creative workflows. Rather than requiring separate applications, instruction based editing could become another tool in professional software packages.
Practical Advice for Filmmakers
If you want to experiment with instruction based video editing using Editto or similar technologies, several practices improve results.
Start with clear, specific instructions. Vague commands like "make it better" produce unpredictable results. Descriptive instructions like "add warm golden hour lighting" or "create a cyberpunk aesthetic with neon highlights" work more reliably.
Test different phrasings of the same intent. Language models interpret instructions based on training data patterns. Trying multiple ways to describe the same edit often reveals which formulations produce better results.
Break complex modifications into sequential steps rather than combining everything into one instruction. Apply a style transfer first, then make object level modifications, then adjust lighting. This sequential approach often yields more controlled results than attempting simultaneous changes.
Maintain realistic expectations about precision. The technology excels at applying general modifications but may struggle with pixel perfect control over specific details. Use it for broad strokes and refinements rather than expecting surgical precision.
Combine with traditional tools for best results. Use instruction based editing for rapid exploration and major modifications, then employ conventional editing software for finetuning and critical adjustments where precise control matters.
Conclusion
Ditto and Editto represent measurable progress in instruction based video editing. By addressing the training data scarcity problem through systematic synthetic data generation, the researchers enabled a model that follows text instructions more reliably than previous systems.
The technology offers practical value for filmmakers and content creators who need to explore visual treatments efficiently or apply modifications at scale. While limitations remain, the improvements in temporal coherence and instruction following accuracy make the system more viable for real world workflows than earlier approaches.
The open release of the dataset, model weights, and code enables continued development by the broader research community. As the technology matures through further research and additional training data, instruction based video editing will likely become a standard tool in content creation pipelines.
For AI filmmakers, this development signals that text based control over video is becoming increasingly practical. The ability to modify footage through natural language instructions reduces technical barriers and accelerates creative iteration, supporting more efficient workflows from previsualization through final delivery.
Explore how AI video tools can enhance your creative workflow at our AI Video Generator, and stay informed about emerging technologies like Ditto that expand what's possible in AI assisted filmmaking.
Resources:
- Project Page: https://editto.net
- Code Repository: https://github.com/EzioBy/Ditto
- Research Paper: https://huggingface.co/papers/2510.15742
- Dataset: https://huggingface.co/datasets/QingyanBai/Ditto-1M


