SwiftI2V: 2K Image-to-Video Generation at 202x Lower Cost
Share this post:
SwiftI2V: 2K Image-to-Video Generation at 202x Lower Cost
Researchers at HKUST have published SwiftI2V, a framework for generating native 2K image-to-video on a single GPU in roughly two minutes. The core claim: 202 times fewer GPU seconds than comparable end to end methods, at the same output quality.
The paper was submitted to NeurIPS 2026 by a team from HKUST, CUHK, HKU, Joy Future Academy, and Huawei Research. Code and model weights are not yet released, but the project page and arXiv preprint detail the approach in full.
The Cost Problem with 2K Video
Generating video at 2K resolution (2560x1408) with existing methods is expensive. CineScale, a leading end to end image-to-video model, requires 22,400 GPU seconds per 81 frame clip at that resolution. That is roughly 6.2 GPU hours per video.
The alternative, using a video super resolution pipeline, is faster but introduces different tradeoffs: upscaling separately from generation tends to drift away from the original input image, reducing background consistency and fine detail.
SwiftI2V targets a third path. It matches end to end quality while cutting generation time to 111 seconds on a single H800 GPU.
Two Stage Architecture
SwiftI2V separates motion planning from detail synthesis across two stages.
Stage I generates a low resolution (360P) motion reference using a large Diffusion Transformer backbone. This stage decides how the scene moves and changes over time, operating at a resolution where computation is cheap.
Stage II takes that motion reference and the original input image, then synthesizes the full 2K output. The core innovation is Conditional Segment-wise Generation (CSG). Instead of processing all 81 frames as one block, CSG divides the temporal sequence into bounded segments. Anchor blocks derived from the input image and neighboring blocks attend bidirectionally within each processing window, keeping temporal coherence without rigid autoregressive constraints.
The practical result: memory stays bounded regardless of total frame count, enabling generation up to 241 frames with near linear time scaling.
SwiftI2V output at 2K resolution
Image-to-video from a static photograph
Benchmark Results
SwiftI2V achieves the highest total VBench score (6.4244) among methods tested at 2K resolution.
| Model | VBench Total | Dynamic Degree | GPU Time (s) |
|---|---|---|---|
| SwiftI2V | 6.4244 | 0.3008 | 111 |
| CineScale | 6.3638 | 0.1667 | 22,400 |
| LTX-2 | 6.3579 | 0.0488 | ~160 |
| Stream-DiffVSR | 6.4240 | 0.3374 | N/A |
The comparison with LTX-2 is worth noting. SwiftI2V runs 1.37 times faster and produces 30 times more dynamic output (dynamic degree 0.3008 vs. 0.0488), while staying competitive on image fidelity metrics.
Against CineScale, SwiftI2V delivers the same quality with a 202 times reduction in compute. That gap comes from Stage I operating at 360P, with Stage II handling spatial detail only.
Background consistency reaches 0.9975, the highest among tested models. Video super resolution approaches, which generate low resolution content then upscale, score 0.9933. Explicit image conditioning in Stage II preserves input detail that post hoc upsampling loses.
2K I2V Gallery
Diverse subjects, including human portraits, animals, landscapes, and dynamic scenes
Running SwiftI2V Locally
Code and model checkpoints have not yet been released. The GitHub repository currently hosts the project page only. Monitor the repository for the official release.
When code ships, the expected setup will follow the standard pattern for Diffusion Transformer based video models.
Minimum hardware requirements:
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- System RAM: 32GB recommended
- CUDA 12.x, Python 3.10+
Expected installation steps:
git clone https://github.com/hkust-longgroup/SwiftI2V
cd SwiftI2V
pip install -r requirements.txt
Expected inference workflow:
Stage I generates a 360P motion clip from the input image. Stage II uses that clip together with the original image to produce the full 2K output. The paper reports 4-step sampling across both stages combined. A minimal inference call will likely resemble:
python inference.py \
--input_image path/to/image.jpg \
--output_dir ./outputs \
--num_frames 81 \
--resolution 2560x1408
The team plans to release both Stage I and Stage II checkpoints. Exact commands will be confirmed in the README at code release.
Open Source and License
The GitHub repository is public. As of May 2026, no license file has been published and no model checkpoints are available for download.
The author team comes from academic institutions with Huawei Research as a collaborator. License terms and commercial use conditions will become clear when the code ships.
The paper and project page contain full technical details on the architecture and training strategy in the meantime.
For image-to-video generation available now, AI FILMS Studio offers models including LTX-2, Kling, and Wan with no installation required. For a broader look at how researchers are approaching variable resolution video, see URSA: AI Video Model Generates Any Resolution Without Retraining.
Sources
arXiv: SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation Project Page: hkust-longgroup.github.io/SwiftI2V GitHub: hkust-longgroup/SwiftI2V
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- OpenAI Sora 2
- Kling 3.0
- Kling O1
- Google Veo 3.1
- LTX 2.3
- Kling O1
- Hailuo AI
- Luma Ray
- Kling 3.0 Motion
- Topaz Upscaler
- InfiniteTalk Face Swap


