InfinityStar: 10x Faster Video Generation Through Autoregressive Modeling

November 22, 2025

Share this post:

InfinityStar: 10x Faster Video Generation Through Autoregressive Modeling

FoundationVision released InfinityStar on November 7, 2025. The unified spacetime autoregressive framework generates 720p video approximately 10 times faster than diffusion based methods while scoring 83.74 on VBench.

Autoregressive Approach

InfinityStar uses discrete autoregressive modeling to jointly capture spatial and temporal dependencies. This differs from diffusion models by treating visual generation as sequence prediction, similar to language model architectures.

The system generates content incrementally. Each visual token depends on previously generated tokens, creating coherent sequences. For images, this happens spatially from left to right and top to bottom. For video, the process extends into the temporal dimension with frame by frame generation.

Architecture components:

8 billion parameters
Flan-T5-XL text encoder
FlexAttention for training acceleration
Requires PyTorch 2.5.1 or higher

The spacetime pyramid modeling decomposes videos into sequential clips. Static appearance encodes in the first frame. Duration distributes equally across subsequent frames. This separation enables efficient processing of both spatial and temporal information.

Example outputs:

Performance Metrics

VBench evaluation measures five key dimensions: text-video consistency, visual quality, structural stability, motion effects, and frame aesthetics.

InfinityStar achieves 83.74, outperforming all autoregressive competitors and surpassing HunyuanVideo (83.24). The model accomplishes this while generating 5-second 720p video in approximately 58 seconds.

Speed comparison: 10× faster than leading diffusion-based methods without additional optimizations. This advantage stems from the autoregressive architecture's sequential prediction versus diffusion's iterative denoising.

Benchmark results:

VBench score: 83.74
Resolution support: 480p and 720p
Video duration: 5 seconds (720p), 5-10 seconds (480p)
Model size: ~35GB checkpoint
First discrete autoregressive model for industrial level 720p video

The performance demonstrates that discrete approaches can compete with continuous diffusion models in video generation tasks. Prior autoregressive attempts struggled with resolution and temporal coherence. InfinityStar overcomes these limitations through unified spacetime modeling.

Additional examples:

Generation Capabilities

The unified architecture supports multiple generation modes without separate models.

Text-to-image: Sequential token prediction builds images from text descriptions. The model generates visual tokens that represent image regions, assembling complete scenes through autoregressive processing.

Text-to-video: Extends text-to-image with temporal autoregression. Each frame depends on previous frames and text conditioning, maintaining consistency across the sequence.

Image-to-video: Animates static images by using the input as the first frame. The model generates subsequent frames that maintain visual consistency with the source image while introducing motion.

Video continuation: Extends existing videos by treating them as initial frames. The system generates additional frames that preserve style and content continuity.

Long interactive video: 480p mode supports interactive generation with multiple prompts. Users can guide content evolution through sequential text inputs during generation.

The 720p model focuses on 5 second video at high quality. The 480p model offers flexibility with 5 and 10 second durations, optimized for image-to-video and video-to-video tasks rather than text-to-video.

Technical Implementation

The framework uses knowledge inheritance from continuous video tokenizers. This strategy addresses two challenges: training from scratch converges slowly, and image pretrained weights don't optimize for video reconstruction.

Stochastic Quantizer Depth alleviates imbalanced information distribution across scales during tokenizer training. This ensures comprehensive learning of visual details at different resolution levels.

Semantic Scales Repetition refines predictions of early semantic scales within video sequences. The technique significantly enhances fine grained details and complex motion in generated content.

The dual stream to single stream architecture processes video and text tokens independently before fusion. Early Transformer blocks allow each modality to learn appropriate modulation mechanisms without interference. Later blocks concatenate tokens for multimodal information integration.

FlexAttention accelerates training by improving computational efficiency during attention operations. This mechanism enables faster processing of long sequences and reduces training time.

Installation and Setup

The system requires specific computational resources and software versions.

Requirements:

PyTorch 2.5.1 or higher (FlexAttention support)
CUDA-compatible GPU with sufficient VRAM
~35GB storage for model checkpoint
Python 3.8 or higher

Installation process involves cloning the repository, installing dependencies through pip, and downloading model checkpoints. The team provides comprehensive workflows for data organization, feature extraction, and training scripts.

Available models:

720p checkpoint for 5 second video generation
480p checkpoint for variable length generation (5-10 seconds)
Interactive generation checkpoint for 480p

Inference Modes

Three primary inference scripts handle different generation scenarios.

720p video generation (tools/infer_video_720p.py): Produces 5 second videos at 720p resolution. Due to high training costs, the released model optimizes for this specific duration. The script supports image-to-video by specifying an image path.

480p variable length generation (tools/infer_video_480p.py): Creates videos of 5 or 10 seconds at 480p. Edit the generation_duration variable to specify length. Not optimized for text-to-video, but recommended for image-to-video and video-to-video modes. Supports video continuation by providing a video path.

480p long interactive generation (tools/infer_interact_480p.py): Enables interactive video creation. Provide a reference video and multiple prompts. The model generates content interactively based on sequential instructions.

For filmmakers exploring AI video generation workflows, try AI FILMS Studio's video generation tools to experiment with multiple text-to-video and image-to-video models across different approaches.

Limitations and Considerations

The 720p model limits generation to 5 seconds due to training costs. Longer sequences require the 480p model, which provides flexibility at lower resolution but different quality characteristics.

Model size requires substantial storage. The 35GB checkpoint demands significant disk space and VRAM during inference. This may restrict deployment options for resource constrained environments.

The 480p model's lack of text-to-video optimization affects direct text prompting quality. Image-to-video and video-to-video modes perform better, requiring different workflows than pure text based generation.

Training complexity poses barriers to finetuning. The system requires large scale video corpora, substantial computational resources, and extended training time. Custom adaptation demands significant infrastructure investment.

FlexAttention dependency restricts PyTorch versions. Older installations require updates to 2.5.1 or higher, potentially causing compatibility issues with existing codebases.

What This Means for Video Production

The autoregressive approach proves competitive with diffusion methods. Speed advantages create practical benefits for iterative workflows where rapid generation enables faster experimentation.

Sequential prediction mirrors language models, suggesting potential for unified architectures across modalities. This convergence could simplify future development by applying similar techniques to different generation tasks.

The 10× speed improvement changes iteration economics. Projects requiring multiple generation attempts benefit from reduced wait times, enabling more exploration within time constraints.

Open source availability under MIT license permits commercial use. Studios and developers can integrate the technology into production pipelines without licensing restrictions. The research community can build extensions and improvements.

Industrial 720p capability establishes autoregressive models as viable alternatives to diffusion. Previous autoregressive attempts struggled with resolution and quality. InfinityStar demonstrates these limitations aren't fundamental to the approach.

The unified architecture simplifies deployment. Single model supports multiple generation modes, reducing infrastructure complexity compared to maintaining separate systems for different tasks.

NeurIPS 2025 Oral acceptance indicates research community recognition. This validation suggests the approach represents meaningful advancement rather than incremental improvement.

Open Source and Licensing

Released under MIT License for both code and models. Commercial use permitted without restrictions.

The team provides complete training and inference code. Researchers can reproduce results, modify architectures, and extend capabilities. This transparency supports further development and adaptation.

Available resources:

Complete inference implementation
Training workflows and scripts
Model checkpoints for 720p and 480p
Data organization guidelines
Feature extraction tools

The release strategy prioritizes accessibility. FoundationVision makes all components publicly available to foster research in efficient video generation.

Sources:

Project Website: https://infinitystar.org/
GitHub Repository: https://github.com/FoundationVision/InfinityStar
ArXiv Paper: https://arxiv.org/abs/2511.04675

Continue Reading

Nov 24, 2025

Greg Kohs: The Human Face Behind AI Documentary Filmmaking

Filmmaker Greg Kohs focuses on humans behind AI technology rather than fear-driven narratives. From AlphaGo to The Thinking Game, his work explores empathy and intention.

Nov 21, 2025

Ron Howard on AI: Filmmaking Tool, Not Replacement

Apollo 13 and A Beautiful Mind director says AI helps filmmakers realize creative visions efficiently. Stresses need for copyright legislation to protect creative livelihoods.

Nov 21, 2025

Tencent HunyuanVideo 1.5: 8.3B Parameters Run on Consumer GPUs

Tencent releases HunyuanVideo 1.5 with 8.3 billion parameters. State-of-the-art quality runs on 14GB VRAM through sparse attention and efficient VAE compression.

View all Posts