SwiftI2V: 2K Image-to-Video Generation at 202x Lower Cost

May 8, 2026

Share this post:

SwiftI2V: 2K Image-to-Video Generation at 202x Lower Cost

Researchers at HKUST have published SwiftI2V, a framework for generating native 2K image-to-video on a single GPU in roughly two minutes. The core claim: 202 times fewer GPU seconds than comparable end to end methods, at the same output quality.

The paper was submitted to NeurIPS 2026 by a team from HKUST, CUHK, HKU, Joy Future Academy, and Huawei Research. Code and model weights are not yet released, but the project page and arXiv preprint detail the approach in full.

The Cost Problem with 2K Video

Generating video at 2K resolution (2560x1408) with existing methods is expensive. CineScale, a leading end to end image-to-video model, requires 22,400 GPU seconds per 81 frame clip at that resolution. That is roughly 6.2 GPU hours per video.

The alternative, using a video super resolution pipeline, is faster but introduces different tradeoffs: upscaling separately from generation tends to drift away from the original input image, reducing background consistency and fine detail.

SwiftI2V targets a third path. It matches end to end quality while cutting generation time to 111 seconds on a single H800 GPU.

Two Stage Architecture

SwiftI2V separates motion planning from detail synthesis across two stages.

Stage I generates a low resolution (360P) motion reference using a large Diffusion Transformer backbone. This stage decides how the scene moves and changes over time, operating at a resolution where computation is cheap.

Stage II takes that motion reference and the original input image, then synthesizes the full 2K output. The core innovation is Conditional Segment-wise Generation (CSG). Instead of processing all 81 frames as one block, CSG divides the temporal sequence into bounded segments. Anchor blocks derived from the input image and neighboring blocks attend bidirectionally within each processing window, keeping temporal coherence without rigid autoregressive constraints.

The practical result: memory stays bounded regardless of total frame count, enabling generation up to 241 frames with near linear time scaling.

SwiftI2V output at 2K resolution

Image-to-video from a static photograph

Benchmark Results

SwiftI2V achieves the highest total VBench score (6.4244) among methods tested at 2K resolution.

Model	VBench Total	Dynamic Degree	GPU Time (s)
SwiftI2V	6.4244	0.3008	111
CineScale	6.3638	0.1667	22,400
LTX-2	6.3579	0.0488	~160
Stream-DiffVSR	6.4240	0.3374	N/A

The comparison with LTX-2 is worth noting. SwiftI2V runs 1.37 times faster and produces 30 times more dynamic output (dynamic degree 0.3008 vs. 0.0488), while staying competitive on image fidelity metrics.

Against CineScale, SwiftI2V delivers the same quality with a 202 times reduction in compute. That gap comes from Stage I operating at 360P, with Stage II handling spatial detail only.

Background consistency reaches 0.9975, the highest among tested models. Video super resolution approaches, which generate low resolution content then upscale, score 0.9933. Explicit image conditioning in Stage II preserves input detail that post hoc upsampling loses.

2K I2V Gallery

Diverse subjects, including human portraits, animals, landscapes, and dynamic scenes

Running SwiftI2V Locally

Code and model checkpoints have not yet been released. The GitHub repository currently hosts the project page only. Monitor the repository for the official release.

When code ships, the expected setup will follow the standard pattern for Diffusion Transformer based video models.

Minimum hardware requirements:

GPU: NVIDIA RTX 4090 (24GB VRAM)
System RAM: 32GB recommended
CUDA 12.x, Python 3.10+

Expected installation steps:

git clone https://github.com/hkust-longgroup/SwiftI2V
cd SwiftI2V
pip install -r requirements.txt

Expected inference workflow:

Stage I generates a 360P motion clip from the input image. Stage II uses that clip together with the original image to produce the full 2K output. The paper reports 4-step sampling across both stages combined. A minimal inference call will likely resemble:

python inference.py \
  --input_image path/to/image.jpg \
  --output_dir ./outputs \
  --num_frames 81 \
  --resolution 2560x1408

The team plans to release both Stage I and Stage II checkpoints. Exact commands will be confirmed in the README at code release.

Open Source and License

The GitHub repository is public. As of May 2026, no license file has been published and no model checkpoints are available for download.

The author team comes from academic institutions with Huawei Research as a collaborator. License terms and commercial use conditions will become clear when the code ships.

The paper and project page contain full technical details on the architecture and training strategy in the meantime.

For image-to-video generation available now, AI FILMS Studio offers models including LTX-2, Kling, and Wan with no installation required. For a broader look at how researchers are approaching variable resolution video, see URSA: AI Video Model Generates Any Resolution Without Retraining.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

arXiv: SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation Project Page: hkust-longgroup.github.io/SwiftI2V GitHub: hkust-longgroup/SwiftI2V

Continue Reading

May 7, 2026

Golden Globes Set AI Rules: Submissions Allowed, Performances Banned

The Golden Globes announced AI does not disqualify submissions if human creative direction remains primary, taking a more permissive stance than the Academy's Oscars ban.

May 7, 2026

Hirokazu Kore-eda Brings a Humanoid Drama to Cannes Competition

Hirokazu Kore-eda's near future drama 'Sheep in the Box', about a couple who adopt a humanoid AI as their son, premieres in the 79th Cannes competition.

May 6, 2026

Cannes Marché du Film 2026: AI Moves to Center Stage

The 2026 Marché du Film at Cannes places AI at the center of its program, with a closed-door AI for Talent Summit and the largest virtual production stage ever at a film market.

View all Posts