EditorNodesPricingBlog

SCAIL-2 Animates Any Character From a Driving Video Without Skeleton Extraction

June 10, 2026
SCAIL-2 Animates Any Character From a Driving Video Without Skeleton Extraction

Share this post:

SCAIL-2 Animates Any Character From a Driving Video Without Skeleton Extraction

Z.ai (Zhipu AI) released SCAIL-2 on June 9, 2026, an end to end character animation model that transfers motion from any driving video to any reference character in a single model pass. No skeleton estimation, no pose extraction, and no intermediate representation are required at any stage. The model is released under Apache 2.0, allowing commercial use.

SCAIL-2 handles four animation scenarios from one model: single character animation, cross identity replacement, multi character scenes, and zero shot generalization to animals and nonstandard figures. The paper and weights were published simultaneously to arXiv and HuggingFace on June 9.

SCAIL-2 teaser: multi character motion transfer across a group scene from a single driving video

Latent Conditioning, Not Skeleton Extraction

Conventional character animation models work in two stages: extract skeleton keypoints from the reference character, then map motion from a driving video onto those keypoints. That approach fails when the reference is a 2D illustration, a cartoon, or an animal, because the keypoint extractor assumes a humanoid body structure it can parse and remap. When the body is stylized, nonstandard, or nonhumanoid, keypoint extraction either fails or produces distorted results.

SCAIL-2 replaces that pipeline entirely. Motion from the driving video is encoded directly into latent conditions, and those conditions are applied to the reference character in latent space. No skeleton or pose keypoints are extracted at any stage. The model, trained on MotionPair-60K, a purpose built dataset of 60,000 end to end motion transfer pairs, learns to transfer motion across styles, species, and body structures without any structured intermediate representation.

Single Character Animation

The three comparison videos below show SCAIL-2 transferring motion to reference characters from driving footage. Each video compares SCAIL-2 output against a baseline skeleton based method across three different test sequences.

Single character animation, challenging reference (SCAIL-2 vs. baseline)

Single character animation comparison (SCAIL-2 vs. baseline)

Single character animation, additional comparison sequence

The paper reports that SCAIL-2 "substantially outperforms existing state of the art approaches across all character animation tasks", including on temporally challenging sequences where skeleton based methods accumulate error across frames. Output resolution is 512p and 704p.

Cross Identity Replacement and Multi Character Scenes

Cross identity replacement transfers the motion of one character onto the appearance of a different character. The replacement comparison video below shows SCAIL-2 swapping character appearance while preserving the motion from the driving sequence. The multi character comparison shows the model handling simultaneous motion transfer across a group scene.

Cross identity replacement: motion transferred to a different character's appearance

Multi character scene comparison (SCAIL-2 vs. baseline)

SCAIL-2's latent conditioning approach is what enables cross identity replacement without explicit skeleton matching. Because the model does not depend on extracting keypoints from either the source or target character, swapping appearances does not require the two characters to share the same body structure or proportions.

Zero Shot Generalization

The zero shot generalization video shows SCAIL-2 driving motion for character types it was not explicitly trained on. The GT overlay compares SCAIL-2 output against the ground truth motion, demonstrating how the model generalizes to animal figures and nonstandard body types.

Zero shot generalization to nonstandard characters with ground truth overlay

This is where the skeleton free approach has the most practical consequence. An indie production working with a creature design, a 2D graphic novel character, or illustrated source material can drive SCAIL-2 directly from live reference footage. No rigging pass is required. Skeleton based methods require the reference to match a human body schema before motion transfer is possible, which makes them incompatible with any reference that does not conform to that schema.

MotionPair-60K and Benchmark Results

SCAIL-2 is trained on MotionPair-60K, a synthetic dataset of 60,000 motion transfer pairs built specifically for end to end motion transfer rather than single frame pose matching. The distinction is significant for temporal consistency: MotionPair-60K pairs full motion sequences, not isolated frames, which is why SCAIL-2 maintains coherence across movement and occlusion rather than drifting between frames.

The model is released under Apache 2.0, confirming commercial use is permitted. Weights are available directly on HuggingFace, with inference code and installation instructions in the GitHub repository. For combining SCAIL-2 motion transfer with consistent character appearance across different shots, BindWeave addresses the appearance consistency layer that SCAIL-2 does not handle. For full character performance including dialogue, LongCat Avatar 1.5 handles lip sync and can be used after SCAIL-2 drives the body motion. For 3D character creation as upstream input, 3DreamBooth generates 3D subjects from reference images before animation.

Filmmakers working on character animation can build these workflows in the AI FILMS Studio video workspace.


Sources

arXiv: SCAIL-2: End-to-End Character Animation Without Pose and Skeleton Estimation
Project Page: teal024.github.io/SCAIL-2
GitHub: zai-org/SCAIL-2
HuggingFace: zai-org/SCAIL-2