InfinityStar: 10x Faster Video Generation Through Autoregressive Modeling

Share this post:
InfinityStar: 10x Faster Video Generation Through Autoregressive Modeling
FoundationVision released InfinityStar on November 7, 2025. The unified spacetime autoregressive framework generates 720p video approximately 10 times faster than diffusion based methods while scoring 83.74 on VBench.
Autoregressive Approach
InfinityStar uses discrete autoregressive modeling to jointly capture spatial and temporal dependencies. This differs from diffusion models by treating visual generation as sequence prediction, similar to language model architectures.
The system generates content incrementally. Each visual token depends on previously generated tokens, creating coherent sequences. For images, this happens spatially from left to right and top to bottom. For video, the process extends into the temporal dimension with frame by frame generation.
Architecture components:
- 8 billion parameters
- Flan-T5-XL text encoder
- FlexAttention for training acceleration
- Requires PyTorch 2.5.1 or higher
The spacetime pyramid modeling decomposes videos into sequential clips. Static appearance encodes in the first frame. Duration distributes equally across subsequent frames. This separation enables efficient processing of both spatial and temporal information.
Example outputs:
Performance Metrics
VBench evaluation measures five key dimensions: text-video consistency, visual quality, structural stability, motion effects, and frame aesthetics.
InfinityStar achieves 83.74, outperforming all autoregressive competitors and surpassing HunyuanVideo (83.24). The model accomplishes this while generating 5-second 720p video in approximately 58 seconds.
Speed comparison: 10× faster than leading diffusion-based methods without additional optimizations. This advantage stems from the autoregressive architecture's sequential prediction versus diffusion's iterative denoising.
Benchmark results:
- VBench score: 83.74
- Resolution support: 480p and 720p
- Video duration: 5 seconds (720p), 5-10 seconds (480p)
- Model size: ~35GB checkpoint
- First discrete autoregressive model for industrial level 720p video
The performance demonstrates that discrete approaches can compete with continuous diffusion models in video generation tasks. Prior autoregressive attempts struggled with resolution and temporal coherence. InfinityStar overcomes these limitations through unified spacetime modeling.
Additional examples:
Generation Capabilities
The unified architecture supports multiple generation modes without separate models.
Text-to-image: Sequential token prediction builds images from text descriptions. The model generates visual tokens that represent image regions, assembling complete scenes through autoregressive processing.
Text-to-video: Extends text-to-image with temporal autoregression. Each frame depends on previous frames and text conditioning, maintaining consistency across the sequence.
Image-to-video: Animates static images by using the input as the first frame. The model generates subsequent frames that maintain visual consistency with the source image while introducing motion.
Video continuation: Extends existing videos by treating them as initial frames. The system generates additional frames that preserve style and content continuity.
Long interactive video: 480p mode supports interactive generation with multiple prompts. Users can guide content evolution through sequential text inputs during generation.
The 720p model focuses on 5 second video at high quality. The 480p model offers flexibility with 5 and 10 second durations, optimized for image-to-video and video-to-video tasks rather than text-to-video.
Technical Implementation
The framework uses knowledge inheritance from continuous video tokenizers. This strategy addresses two challenges: training from scratch converges slowly, and image pretrained weights don't optimize for video reconstruction.
Stochastic Quantizer Depth alleviates imbalanced information distribution across scales during tokenizer training. This ensures comprehensive learning of visual details at different resolution levels.
Semantic Scales Repetition refines predictions of early semantic scales within video sequences. The technique significantly enhances fine grained details and complex motion in generated content.
The dual stream to single stream architecture processes video and text tokens independently before fusion. Early Transformer blocks allow each modality to learn appropriate modulation mechanisms without interference. Later blocks concatenate tokens for multimodal information integration.
FlexAttention accelerates training by improving computational efficiency during attention operations. This mechanism enables faster processing of long sequences and reduces training time.
Installation and Setup
The system requires specific computational resources and software versions.
Requirements:
- PyTorch 2.5.1 or higher (FlexAttention support)
- CUDA-compatible GPU with sufficient VRAM
- ~35GB storage for model checkpoint
- Python 3.8 or higher
Installation process involves cloning the repository, installing dependencies through pip, and downloading model checkpoints. The team provides comprehensive workflows for data organization, feature extraction, and training scripts.
Available models:
- 720p checkpoint for 5 second video generation
- 480p checkpoint for variable length generation (5-10 seconds)
- Interactive generation checkpoint for 480p
Inference Modes
Three primary inference scripts handle different generation scenarios.
720p video generation (tools/infer_video_720p.py): Produces 5 second videos at 720p resolution. Due to high training costs, the released model optimizes for this specific duration. The script supports image-to-video by specifying an image path.
480p variable length generation (tools/infer_video_480p.py): Creates videos of 5 or 10 seconds at 480p. Edit the generation_duration variable to specify length. Not optimized for text-to-video, but recommended for image-to-video and video-to-video modes. Supports video continuation by providing a video path.
480p long interactive generation (tools/infer_interact_480p.py): Enables interactive video creation. Provide a reference video and multiple prompts. The model generates content interactively based on sequential instructions.
For filmmakers exploring AI video generation workflows, try AI FILMS Studio's video generation tools to experiment with multiple text-to-video and image-to-video models across different approaches.
Limitations and Considerations
The 720p model limits generation to 5 seconds due to training costs. Longer sequences require the 480p model, which provides flexibility at lower resolution but different quality characteristics.
Model size requires substantial storage. The 35GB checkpoint demands significant disk space and VRAM during inference. This may restrict deployment options for resource constrained environments.
The 480p model's lack of text-to-video optimization affects direct text prompting quality. Image-to-video and video-to-video modes perform better, requiring different workflows than pure text based generation.
Training complexity poses barriers to finetuning. The system requires large scale video corpora, substantial computational resources, and extended training time. Custom adaptation demands significant infrastructure investment.
FlexAttention dependency restricts PyTorch versions. Older installations require updates to 2.5.1 or higher, potentially causing compatibility issues with existing codebases.
What This Means for Video Production
The autoregressive approach proves competitive with diffusion methods. Speed advantages create practical benefits for iterative workflows where rapid generation enables faster experimentation.
Sequential prediction mirrors language models, suggesting potential for unified architectures across modalities. This convergence could simplify future development by applying similar techniques to different generation tasks.
The 10× speed improvement changes iteration economics. Projects requiring multiple generation attempts benefit from reduced wait times, enabling more exploration within time constraints.
Open source availability under MIT license permits commercial use. Studios and developers can integrate the technology into production pipelines without licensing restrictions. The research community can build extensions and improvements.
Industrial 720p capability establishes autoregressive models as viable alternatives to diffusion. Previous autoregressive attempts struggled with resolution and quality. InfinityStar demonstrates these limitations aren't fundamental to the approach.
The unified architecture simplifies deployment. Single model supports multiple generation modes, reducing infrastructure complexity compared to maintaining separate systems for different tasks.
NeurIPS 2025 Oral acceptance indicates research community recognition. This validation suggests the approach represents meaningful advancement rather than incremental improvement.
Open Source and Licensing
Released under MIT License for both code and models. Commercial use permitted without restrictions.
The team provides complete training and inference code. Researchers can reproduce results, modify architectures, and extend capabilities. This transparency supports further development and adaptation.
Available resources:
- Complete inference implementation
- Training workflows and scripts
- Model checkpoints for 720p and 480p
- Data organization guidelines
- Feature extraction tools
The release strategy prioritizes accessibility. FoundationVision makes all components publicly available to foster research in efficient video generation.
Sources:
- Project Website: https://infinitystar.org/
- GitHub Repository: https://github.com/FoundationVision/InfinityStar
- ArXiv Paper: https://arxiv.org/abs/2511.04675


