EditorPricingBlog

URSA: AI Video Model Generates Any Resolution Without Retraining

October 29, 2025
URSA: AI Video Model Generates Any Resolution Without Retraining

Share this post:

URSA: AI Video Model Generates Any Resolution Without Retraining

Most AI video models lock you into fixed resolutions. Want vertical video for social media? You need a model trained specifically for that aspect ratio. Need widescreen cinematic footage? That requires a different model entirely. URSA changes this by generating video at any resolution or aspect ratio from a single unified system.

The Beijing Academy of Artificial Intelligence (BAAI) released URSA as an open source model that handles everything from square Instagram posts to ultra wide cinematic frames without requiring separate training runs or resolution specific finetuning. The system uses what researchers call "block wise attention" to process videos of arbitrary dimensions while maintaining quality comparable to resolution specific models.

The Resolution Problem

Current text-to-video models train on data at specific resolutions. A model trained at 1024x576 produces that exact output dimension. If you need 720x1280 for vertical video, or 1920x1080 for standard HD, or 2048x864 for cinema wide content, you either need a different model or accept degraded quality from resizing.

This limitation creates practical problems for filmmakers and content creators. Social platforms demand different aspect ratios: Instagram favors 1:1 squares or 9:16 vertical, YouTube wants 16:9 horizontal, and cinema screens use 2.39:1 ultra wide. Producing content for multiple platforms means either maintaining separate model versions or compromising on quality through cropping and scaling.

The technical reason for this constraint lies in how diffusion transformers process video data. These models use positional encodings that assume specific spatial dimensions. The attention mechanisms and positional information become coupled to training resolution, preventing effective generation at other sizes.

Previous attempts to solve this typically involve training multiple model versions at different resolutions or using techniques like progressive growing where models start at low resolution and gradually increase. Both approaches require extensive computational resources and don't provide true arbitrary resolution capability.

How URSA Achieves Universal Resolution

URSA addresses the resolution constraint through two architectural innovations: block wise attention and decoupled positional embeddings. These changes allow the model to process videos of any dimension without the positional encoding issues that limit other systems.

Block wise attention divides video data into fixed size blocks rather than processing entire frames through single attention operations. Each block maintains consistent internal dimensions regardless of overall video size. A 1024x576 video might divide into 4x2 blocks, while a 1920x1080 video uses 7.5x4.2 blocks, but each individual block processes identically.

This block based approach provides two benefits. First, it decouples attention computation from absolute spatial dimensions, since the attention operations only see fixed size blocks. Second, it enables computational efficiency by processing blocks in parallel rather than requiring attention across entire frames.

The positional encoding system uses relative positions within blocks rather than absolute positions in frames. Instead of encoding that a pixel exists at position (512, 288) in a 1024x576 frame, the system encodes relative position within its local block. This makes positional information resolution agnostic.

Training happens at a base resolution, but the architectural design ensures that learned patterns transfer to other resolutions. The model learns relationships between elements in blocks, and those relationships remain valid when blocks represent portions of differently sized frames.

Aspect Ratio Flexibility

Beyond absolute resolution, URSA handles arbitrary aspect ratios without quality degradation. The same model generates vertical 9:16 mobile content, horizontal 16:9 standard video, and ultra wide 21:9 cinematic footage with equal effectiveness.

This aspect ratio flexibility matters for practical production workflows. Content creators no longer need to decide on target aspect ratio before generation or maintain separate models for different platforms. A single generation pass can produce the exact dimensions needed for any distribution channel.

The system maintains composition quality across aspect ratios. Wide shots in 21:9 use horizontal space effectively without awkward empty areas. Portrait 9:16 compositions frame subjects appropriately for vertical viewing. Square 1:1 content balances visual elements within constrained space.

Testing across common aspect ratios shows consistent quality. Standard 16:9 at 1920x1080 matches the visual fidelity of models trained specifically at that resolution. Vertical 9:16 at 720x1280 produces clean mobile ready content. Ultra wide 2.39:1 at 2048x858 delivers cinematic framing without artifacts.

The aspect ratio capability extends beyond standard formats. Unconventional ratios like 4:3 for vintage aesthetic, 2:1 for certain cinema formats, or custom dimensions for specific installations all generate without issues. This flexibility supports creative applications beyond standard platform requirements.

Performance Across Resolutions

Benchmark testing compares URSA against resolution specific models at various dimensions. Results demonstrate that universal resolution capability doesn't sacrifice quality compared to models trained for single resolutions.

At 1024x576, URSA performs comparably to models specifically trained at that resolution on metrics including motion quality, temporal coherence, and prompt adherence. Visual inspection shows similar detail levels and artifact patterns to resolution specific baselines.

Higher resolutions like 1920x1080 maintain quality without the degradation that typically occurs when using models beyond their training resolution. Text remains readable, fine details preserve clarity, and motion stays smooth. The block wise processing prevents the quality loss that happens with naive resolution scaling.

Lower or unconventional resolutions also work effectively. Generating 640x640 square content or 512x896 extreme vertical formats produces clean results without the distortion or quality issues that arise when forcing content into non native dimensions.

The computational cost scales reasonably with resolution. Processing time increases roughly linearly with pixel count rather than exhibiting the super linear scaling that some architectures show. A 1920x1080 generation takes approximately 3.6x longer than 1024x576, matching the increased pixel count (1920×1080 / 1024×576 ≈ 3.6).

Video Duration Capabilities

URSA generates videos up to 5 seconds at 24fps, providing 120 frames of content per generation. This duration proves sufficient for many practical applications while remaining computationally feasible for consumer and professional hardware.

Five second clips cover common content needs including social media posts, B-roll footage, establishing shots, reaction moments, and product demonstrations. While longer than earlier text-to-video models that maxed out at 2-3 seconds, the duration limitation means complete scenes still require multiple generations or different approaches.

The frame count of 120 frames provides enough temporal context for smooth motion and meaningful action. Subjects can move through space, expressions can change naturally, and camera movements can complete meaningful trajectories within this window.

Temporal consistency remains strong across the full 5 second duration. Characters and objects maintain appearance from first frame to last. Background environments stay stable. Camera motion appears smooth without sudden jumps or inconsistencies.

The model handles both static shots with subtle motion and dynamic sequences with significant action. A person speaking to camera maintains consistent lip sync and facial features. A car driving through a scene moves smoothly with appropriate motion blur. Environmental changes like shifting light or moving clouds progress naturally.

Motion Quality and Physical Realism

Motion synthesis represents one of the most challenging aspects of video generation. URSA demonstrates solid motion quality with realistic physics for common scenarios while showing limitations in complex physical interactions.

Human motion captures natural movement patterns. Walking appears fluid with proper weight transfer and limb coordination. Facial expressions change smoothly without discontinuous jumps. Hand gestures maintain proper articulation and timing.

Camera motion produces cinematically appropriate results. Pans sweep smoothly across scenes. Dolly movements push in or pull out with consistent velocity. Static shots maintain proper stability without drift or jitter.

Object motion follows basic physical expectations. Falling objects accelerate appropriately. Moving vehicles maintain consistent speed and direction. Flowing water shows natural fluid dynamics.

Complex interactions show more limitations. Detailed hand manipulation of objects, rapid athletic movements, or intricate mechanical processes may produce artifacts or implausible physics. The model performs best with moderate complexity motion rather than highly detailed physical simulations.

Text Rendering Capabilities

Text rendering in video generation typically produces garbled or inconsistent results. URSA shows improved text handling, generating readable text in many scenarios though with continuing limitations in complex cases.

Simple text elements render clearly. Single words or short phrases on signs, screens, or graphic overlays appear legible and maintain consistency across frames. Font selection shows variety with both serif and sans-serif options appearing appropriately.

Longer text passages work less reliably. Extended sentences or paragraphs may show character substitutions, spacing issues, or inconsistent rendering across frames. The model performs best with short text elements rather than extensive written content.

Text integration into scenes respects perspective and placement. Text on building facades shows appropriate foreshortening and viewing angle. Screen displays render at correct scale and position within the frame. Graphic overlays maintain proper compositing with background elements.

Dynamic text showing motion or transformation presents additional challenges. Scrolling text, animated titles, or transforming letters may lose consistency. Static text works more reliably than animated text elements.

For filmmaking applications requiring text, these capabilities support simpler use cases like establishing location with place names, showing product names in commercial content, or creating title cards. Complex text heavy graphics still require traditional graphic design integration.

Prompt Understanding and Control

URSA processes natural language prompts to generate corresponding video content. The system demonstrates understanding of subject matter, visual style, camera movement, and compositional elements described in text.

Subject descriptions translate effectively to visual content. Prompting for specific people, animals, objects, or environments produces appropriate results. Character descriptions including clothing, appearance, and actions render as specified.

Style terminology works intuitively. Phrases like "cinematic lighting," "documentary style," or "animated aesthetic" influence the visual treatment appropriately. Color palette descriptions affect overall tone and mood.

Camera movement prompts guide how shots unfold. Requesting "slow zoom in," "pan across scene," or "static wide shot" produces corresponding camera motion. The model understands basic cinematographic terminology without requiring technical specifications.

Compositional instructions shape framing and layout. Describing foreground and background elements, requesting specific angles, or specifying subject positioning within frame influences composition as intended.

Limitations appear with highly specific or technical requirements. Precise timing of events, exact camera movements with specific parameters, or intricate scene choreography may not follow prompts exactly. The model works best with clear but not overly constrained descriptions.

Comparing URSA to Commercial Models

URSA positions itself among opensource alternatives to commercial text-to-video platforms. Understanding comparative strengths helps determine appropriate use cases.

Against models like Runway Gen-3 or Pika 2.0, URSA offers the primary advantage of resolution flexibility. Commercial platforms typically generate at fixed resolutions requiring post generation scaling for different aspect ratios. URSA's universal resolution capability eliminates this workflow friction.

Visual quality at standard resolutions shows competitive results. URSA generates content comparable to commercial offerings in terms of detail, coherence, and realism. Specific advantages or limitations depend on content type and prompt complexity rather than showing consistent quality differences.

Motion quality compares favorably for moderate complexity scenarios. Commercial models may show advantages in highly dynamic or physically complex scenes. URSA performs well for common content needs including dialogue, moderate action, and standard camera movements.

The opensource nature provides benefits commercial platforms don't offer. Users can modify the model, integrate it into custom pipelines, finetune for specific styles, and deploy without per generation costs or usage restrictions. This flexibility matters for production workflows requiring customization or integration.

Commercial platforms offer advantages in ease of use, infrastructure, and support. They provide polished interfaces, handle computational requirements, and offer customer service. URSA requires more technical expertise for deployment and use.

Technical Architecture Details

Understanding URSA's architecture helps developers integrate the system or researchers extend its capabilities. The model builds on diffusion transformer foundations with specific modifications enabling universal resolution.

The block division strategy segments frames into 32x32 pixel blocks. This block size provides granular enough division for attention mechanisms while remaining computationally efficient. Each frame divides into a grid of blocks that attention operations process independently.

Cross block attention mechanisms handle relationships between blocks. While within block attention operates densely, cross block attention uses sparse patterns to maintain consistency without full pairwise connections between all blocks. This sparsity enables processing of high resolution content efficiently.

Positional encoding uses learned embeddings that represent relative positions within blocks rather than absolute frame positions. This approach makes positional information resolution agnostic since relative positions within blocks remain consistent regardless of overall frame dimensions.

The temporal dimension uses standard 3D attention mechanisms that extend spatial attention across time. Temporal attention operates at full density across the 120 frame sequence to maintain motion coherence and temporal consistency.

Training uses a curriculum that exposes the model to various resolutions during the learning process. Rather than training exclusively at one resolution, the training set includes examples at different dimensions, helping the model learn resolution invariant features.

Training Data and Dataset Considerations

URSA trains on diverse video datasets spanning multiple resolutions and aspect ratios. The training strategy specifically targets universal resolution capability through exposure to varied dimensional formats.

Video sources include standard stock footage, user generated content, and professionally produced media. This diversity ensures the model encounters various cinematographic styles, subject matter, and production qualities during training.

The resolution distribution in training data intentionally covers common output formats. Training includes 16:9 horizontal content at multiple resolutions, vertical mobile format videos, square social media content, and ultra-wide cinematic footage. This distribution helps the model learn to handle different aspect ratios effectively.

Quality filtering ensures training data meets minimum standards for resolution, motion smoothness, and visual clarity. Low quality or heavily compressed content is filtered out to prevent the model from learning artifacts or degradation patterns.

Caption quality receives particular attention. Each training video pairs with descriptive text that accurately describes visual content, motion, style, and compositional elements. High quality captions improve prompt adherence in generated content.

The dataset size reaches scales typical of current text-to-video models, with millions of video text pairs. Exact dataset size remains undisclosed but falls within the range needed for competitive performance with commercial systems.

Practical Applications for Filmmakers

URSA's universal resolution capability solves real production problems across various content creation scenarios. Understanding these applications helps identify where the technology provides practical value.

Social media content creation benefits directly from aspect ratio flexibility. Producing vertical Stories for Instagram, horizontal posts for YouTube, and square content for feed posts no longer requires separate model versions or quality degrading crops. Generate at target dimensions from the start.

Multi platform distribution becomes more efficient. Create content once at each platform's optimal resolution rather than generating at one resolution and adapting. A single workflow produces platform specific versions without quality compromise.

Cinema and broadcast work can generate previsualization or concept material at appropriate aspect ratios. Wide cinematic 2.39:1 footage for feature film previs, or standard 16:9 for television content, both generate from the same system.

Commercial production for various media benefits from resolution flexibility. Produce vertical content for mobile advertising, horizontal for web placement, and square for certain social platforms, all from consistent source generation.

B-roll and establishing shot creation gains efficiency when resolution matches final delivery requirements. Generate stock style footage at target resolution rather than shooting or licensing separate versions for different aspect ratio needs.

Computational Requirements and Performance

Running URSA requires understanding hardware demands and performance characteristics. The model's efficiency impacts practical deployment for various user types.

GPU memory requirements scale with output resolution. Generating 1024x576 content uses approximately 16GB of VRAM. Higher resolutions like 1920x1080 require proportionally more memory, reaching 40-50GB for full HD generation. This puts high resolution generation beyond consumer GPUs for now.

Generation time varies with resolution and hardware. A 1024x576, 5 second video takes roughly 2-3 minutes on an A100 GPU. Higher resolutions extend generation time proportionally. Consumer GPUs like RTX 4090 can handle lower resolutions but struggle with full HD generation.

The block wise architecture provides better scalability than naive full frame attention. Computational complexity grows linearly with resolution rather than quadratically, making high resolution generation more feasible than with standard transformer architectures.

Batch processing allows generating multiple videos efficiently. The model can process several prompts in parallel if memory permits, improving throughput for production workflows requiring many generations.

Cloud deployment options exist for users without local GPU resources. Services like Hugging Face Spaces host URSA with accessible interfaces, though with usage limits and queue times during high demand.

FineTuning and Customization

The opensource nature enables finetuning for specific visual styles, subject matter, or production requirements. Understanding customization approaches helps advanced users adapt URSA.

Style finetuning adjusts the model to prefer particular aesthetic treatments. Training on curated datasets of specific visual styles, from documentary realism to animated looks, shifts the model's default output toward that style.

Subject specific finetuning improves generation of particular content types. Finetuning on product shots improves commercial content generation. Training on nature footage enhances landscape and wildlife content. Character focused training benefits narrative content with consistent character appearance.

Technical fine-tuning can emphasize particular camera techniques, lighting styles, or compositional approaches. Training on specific cinematographic styles transfers those characteristics to generated content.

The finetuning process requires modest computational resources compared to full training. A few thousand examples and several hours on single GPUs can significantly shift model behavior toward target characteristics.

Integration finetuning adapts the model for pipeline compatibility. Adjusting output formats, metadata inclusion, or generation parameters makes URSA fit better within specific production workflows.

Limitations and Current Constraints

Understanding URSA's limitations helps set appropriate expectations and identify areas requiring alternative approaches or future development.

The 5 second duration limit constrains certain applications. Complete scenes, extended actions, or longer narrative moments require multiple generations or different tools. The model serves best for shorter clips and moments.

Complex motion and physical interactions show artifacts more frequently than simpler content. Detailed hand movements, intricate mechanical processes, or rapid athletic action may not render realistically. Moderate complexity motion produces best results.

Text rendering, while improved over earlier models, still struggles with longer passages or complex typography. Simple text elements work well, but extensive written content or intricate graphic design should use traditional tools.

Character consistency across separate generations remains limited. Multiple 5 second generations of the same character may show appearance variations. For content requiring strong character consistency, this constraint necessitates alternative approaches.

Camera control lacks the precision of traditional CGI or game engine rendering. While the model understands camera movement prompts, exact trajectories and timing require manual specification that text prompts can't provide.

Prompt interpretation shows variability. The same prompt may generate different interpretations across runs. This stochastic nature means multiple generations are often needed to achieve desired results.

Workflow Integration Strategies

Integrating URSA into production workflows requires understanding where it fits within existing processes and how it complements traditional tools.

Pre production serves as natural application area. Generate concept videos showing how scenes might look, test different visual approaches, or create references for crew. URSA supports visualization before committing to production resources.

Storyboarding evolution moves beyond static images. Generate short clips showing motion and timing rather than individual frames. This dynamic storyboarding provides clearer communication of intended results.

Previsualization for complex sequences benefits from quick generation. Test different camera angles, lighting conditions, or compositional approaches rapidly. Iterate on visual concepts before finalizing production plans.

B-roll generation provides supplementary footage for editing. Create establishing shots, cutaways, or transitional content that complements principal photography. Use generated content to fill gaps in coverage.

Placeholder creation for effects heavy sequences gives editors working material. Generate temporary versions of scenes pending final effects work, allowing editors to establish pacing and transitions.

The technology works best complementing rather than replacing traditional production. Use URSA for rapid exploration and supplementary content while maintaining traditional approaches for hero shots and critical moments.

Open Source Advantages

URSA's availability as open-source software provides benefits beyond its technical capabilities. Understanding these advantages helps evaluate the system for various use cases.

Transparency allows understanding exactly how the model works. Researchers and developers can examine architecture details, training procedures, and implementation choices. This transparency supports academic use and enables informed decisions about deployment.

Customization enables adaptation to specific needs. Modify the architecture, adjust parameters, or fine-tune for particular styles without depending on vendors. This flexibility matters for specialized applications or production pipelines.

Integration into existing systems becomes straightforward when code is accessible. Build custom interfaces, connect to other tools, or embed within larger pipelines. Closed commercial systems limit integration to provided APIs.

Cost structure differs from commercial platforms. Open-source deployment incurs infrastructure costs but no per-generation fees or subscription charges. For high-volume generation, this can represent significant savings.

Community development benefits all users. Improvements, extensions, and bug fixes from the community enhance the system for everyone. This collaborative development model accelerates progress.

Privacy and control matter for sensitive applications. Self-hosted deployment ensures content remains on controlled infrastructure. No third parties see prompts or generated content.

Future Development Directions

Several research directions could extend URSA's capabilities and address current limitations. The opensource nature enables community contribution to these developments.

Extended duration beyond 5 seconds would support longer form content. Architectural modifications enabling 10, 30 second, or minute long generation would expand practical applications significantly.

Higher resolution support up to 4K or beyond would serve professional production requirements. The universal resolution architecture should extend to these dimensions with adequate computational resources.

Improved character consistency across generations would enable better narrative content. Mechanisms maintaining character appearance across multiple 5 second clips would support story driven applications.

Enhanced motion quality for complex physical interactions would broaden use cases. Better handling of detailed hand movements, athletic actions, or intricate mechanical processes would reduce limitations.

Multi shot awareness similar to HoloCine would combine narrative coherence with resolution flexibility. This hybrid approach could generate multi shot sequences at arbitrary aspect ratios.

Realtime or near realtime generation would enable interactive applications. Significant speedups through architecture optimization, quantization, or hardware acceleration could support live workflows.

Accessing and Using URSA

URSA is available through multiple channels supporting different user needs and technical capabilities. Understanding access options helps select appropriate deployment methods.

The GitHub repository at github.com/baaivision/URSA provides complete source code, model weights, and documentation. Developers can clone the repository, set up local environments, and run the model on their own hardware.

Hugging Face hosts both the model weights and an interactive demo space at huggingface.co/spaces/BAAI/nova-d48w1024-osp480. The demo interface allows testing URSA through a web browser without local installation.

The project website at bitterdhg.github.io/URSA_page provides comprehensive documentation, examples, and technical details. Video samples demonstrate capabilities across various resolutions and aspect ratios.

Installation requires Python environment setup with appropriate dependencies including PyTorch, transformers, and video processing libraries. The repository includes requirements files and setup instructions.

Running URSA locally requires GPUs with sufficient memory for target resolutions. Lower resolutions work on consumer hardware while higher resolutions need professional GPUs or cloud resources.

Cloud platforms like Google Colab or AWS provide alternative deployment options for users without local GPU access. These services offer temporary GPU instances suitable for experimentation or moderate use.

Conclusion

URSA solves a fundamental limitation of text-to-video generation through universal resolution capability. The ability to generate content at arbitrary dimensions and aspect ratios from a single model eliminates workflow friction and quality compromises that plague current approaches.

The block wise attention architecture and decoupled positional encodings enable this flexibility without sacrificing generation quality. Performance at various resolutions compares favorably to models trained specifically for single dimensions.

For filmmakers and content creators, URSA provides practical tools for multi platform production, previsualization at appropriate aspect ratios, and flexible content generation across distribution channels. The opensource availability enables customization and integration that commercial platforms don't support.

Current limitations around duration, motion complexity, and character consistency across generations indicate areas for future development. As the technology matures, these constraints will likely ease through continued research and community contribution.

The universal resolution paradigm that URSA introduces will likely influence future video generation systems. The ability to produce content at any dimension without specialized training represents a more efficient and flexible approach than maintaining separate models for different resolutions.

URSA demonstrates that architectural innovations can overcome constraints that previously seemed fundamental to video generation. As similar approaches address other limitations like duration, multi shot consistency, and character continuity, AI assisted video production will continue moving toward practical workflows supporting complete creative visions.

Explore our AI Video Generator to experiment with various text-to-video tools, and stay informed about developments like URSA that expand capabilities for AI filmmaking and content creation.

Resources: