Real-Time AI World Generation: Tencent's HY-World 1.5 Runs at 24 FPS

Share this post:
RealTime AI World Generation: Tencent's HY-World 1.5 Runs at 24 FPS
On December 17, 2025, Tencent's Hunyuan AI team released HY-World 1.5 (WorldPlay), the first opensource world model that achieves both realtime generation (24 FPS) and longterm geometric consistency. Unlike previous systems that force users to choose between speed and quality, WorldPlay delivers interactive video generation while maintaining scene coherence when revisiting locations—a fundamental breakthrough for AI filmmaking, game development, and virtual environment creation.
The release includes the complete training framework, inference code, and model weights. Built on HunyuanVideo 1.5, WorldPlay processes user keyboard and mouse inputs to generate streaming video in realtime, making it possible to explore AI generated worlds with the same fluidity as traditional 3D engines.
The Real Time Consistency Problem
World models face a fundamental tradeoff. Fast generation systems like Oasis and Matrix-Game 2.0 achieve realtime speeds but scenes change inconsistently when you return to previous locations. Memory based systems like WorldMem and VMem maintain geometric consistency but require lengthy processing that prevents realtime interaction.
WorldPlay solves this by combining four technical innovations: dual action representation for precise control, reconstituted context memory for longterm consistency, reinforcement learning post-training for action accuracy, and context forcing distillation that enables realtime speeds without sacrificing memory capabilities.
The result: streaming video generation at 24 FPS with stable geometry across hundreds of frames. Scenes remain consistent when revisiting locations, camera movements stay accurate, and physics remain plausible—all while responding instantly to user input.
How WorldPlay Achieves RealTime Consistency
Dual Action Representation
WorldPlay accepts both discrete keyboard inputs (W, A, S, D) and continuous camera poses. Discrete inputs enable physically plausible movement that adapts to scene scale—moving forward in a small room versus an open field feels naturally different. Continuous camera poses provide precise spatial coordinates for memory retrieval, ensuring the system can locate and recall exact viewing positions.
This dual approach solves a training stability problem. Using only discrete actions makes precise location memory difficult. Using only camera poses causes training instability due to scale variance in training data (some scenes are tiny rooms, others are vast landscapes). Combining both achieves robust control across diverse scenes while maintaining accurate spatial memory.
Reconstituted Context Memory
Rather than storing all past frames (computationally intractable) or using simple recent frame history (loses longterm consistency), WorldPlay dynamically rebuilds memory context for each new frame generation. The system maintains two memory types:
Temporal memory comprises the most recent frames to ensure smooth motion continuity. Spatial memory samples from non adjacent past frames based on geometric relevance—FOV overlap and camera distance determine which distant frames might be relevant to the current view.
The innovation is temporal reframing: instead of using absolute temporal indices that grow unbounded, WorldPlay reassigns positional encodings to "pull" important past frames closer in perceived time. A frame from 200 steps ago that's geometrically relevant gets treated as if it's recent, preventing the model's attention from weakening on distant but crucial context.
This enables robust extrapolation for longterm consistency while keeping computational costs manageable.
WorldCompass Reinforcement Learning
Standard training relies on pixel level supervision from raw video, which only implicitly teaches action following. This leads to performance plateaus in complex scenarios like combined actions (moving forward while turning) or long-horizon interactions.
WorldCompass applies reinforcement learning specifically designed for autoregressive video generation. It introduces:
- Clip-level rollout strategy: Instead of full trajectory rollouts (computationally expensive), the system generates shorter clips that boost efficiency while providing granular reward signals.
- Complementary reward functions: Separate rewards for action following accuracy and visual quality prevent reward hacking where the model exploits one metric at the expense of the other.
- Efficient RL algorithm: DiffusionNFT guides the model toward desired behavior without the massive computational overhead typical of video generation RL.
The result is significantly improved action accuracy under complex conditions and reduced visual artifacts in challenging scenarios.
Context Forcing Distillation
The final technical piece enables realtime generation. WorldPlay's memory aware autoregressive student model learns from a bidirectional teacher model, but standard distillation fails because their memory contexts differ. The bidirectional teacher sees past and future, while the autoregressive student only sees past.
Context forcing aligns their memory contexts during distillation. For student self rollout frames, the teacher's context is constructed by masking those same frames from the student's memory context. This alignment makes distribution matching effective, enabling realtime speeds (4 denoising steps instead of 50+) while preserving longterm consistency and mitigating error accumulation.
Technical Performance
WorldPlay was trained on 320,000 curated video clips across four categories:
- AAA game recordings (170K clips, 53%): First and third person gameplay with complex agent behaviors and physics
- Real-world 3D (60K clips, 19%): DL3DV dataset with 3D reconstruction and simulated camera trajectories
- Synthetic 4D (50K clips, 16%): Unreal Engine renders with precise ground truth annotations
- Natural video (40K clips, 12.5%): Sekai dataset with dynamic motion and realistic interactions
The data underwent rigorous filtering: watermark detection, compression artifact removal, optical flow motion analysis, and camera trajectory validation. Dynamic frame rates enhance diversity across playback speeds and motion intensities.
Quantitative Results: In longterm consistency tests (250+ frames), WorldPlay achieves 18.94 PSNR (visual quality), 0.585 SSIM (structural similarity), 0.371 LPIPS (perceptual quality), 0.332 rotation error, and 0.797 translation error. These metrics significantly outperform baseline methods that either sacrifice consistency for speed or speed for consistency.
VBench evaluation shows superior performance across temporal consistency, motion smoothness, subject consistency, and aesthetic quality compared to Gen3C, ViewCrafter, GameCraft, and Matrix-Game 2.0.
Human evaluation reveals 72.9% preference over Gen3C, 92.1% over ViewCrafter, 78.4% over Matrix-Game 2.0, and 88.5% over GameCraft. Even compared to its own non distilled bidirectional teacher model, WorldPlay achieves 48.1% preference while running at realtime speeds.
AI Filmmaking Applications
Pre-Visualization and Concept Development
WorldPlay transforms how filmmakers approach pre-visualization. Instead of expensive pre-vis teams or static storyboards, directors can generate interactive worlds from concept art or text descriptions, then explore camera angles and blocking in realtime.
The system supports both photorealistic and stylized aesthetics, making it suitable for everything from documentary style realism to animated features. Directors can walk through generated environments at 24 FPS, testing different camera movements and identifying optimal shot compositions before physical production begins.
For independent filmmakers without pre-vis budgets, this democratizes access to tools previously available only to major studios. A solo director can generate and explore multiple location options, camera setups, and visual treatments in an afternoon—work that traditionally required weeks and tens of thousands of dollars.
Virtual Location Scouting
Filmmakers can generate specific environments—1920s jazz club, Icelandic lava fields, cyberpunk street market—then navigate through them interactively to assess lighting conditions, spatial relationships, and practical shooting constraints.
The first person and third person perspectives allow both director and cinematographer viewpoints. A DP can evaluate lens choices and lighting setups, while a director assesses blocking and actor movement within the space. The longterm consistency ensures that scouting decisions made in one area remain valid when revisiting from different angles.
Because WorldPlay maintains geometric consistency, measurements and spatial relationships discovered during virtual scouting translate accurately to production design decisions. A director who finds an optimal camera height and distance during virtual exploration can communicate those exact specifications to their DP for physical production.
Dynamic Shot Generation and Camera Planning
WorldPlay's realtime generation enables dynamic camera planning. Cinematographers can execute complex camera movements dolly shots, crane movements, tracking shots—and see the results instantly. The dual action representation ensures precise control while the memory system maintains spatial consistency across the entire move.
The system supports promptable events triggered by text commands during generation. A director exploring a generated scene can type "car crash" or "sudden rain" to see how dramatic events affect the space and lighting. This capability allows realtime experimentation with narrative beats and their visual impact.
For action sequences, directors can choreograph movement and camera simultaneously. The reinforcement learning training ensures accurate action following even with complex combined movements—tracking a running character while panning and dollying—something previous systems struggled to execute correctly.
Background Plate Generation and Set Extension
Filmmakers shooting on green screen can use WorldPlay to generate consistent background plates. The system's longterm consistency ensures that backgrounds remain stable across multiple takes and camera angles—critical for believable compositing.
The third person capability enables full CG character animation previews with generated environments, helping directors visualize how animated or VFX characters will interact with spaces before expensive render time is committed.
3D Reconstruction and Asset Creation
WorldPlay's geometric consistency enables direct integration with 3D reconstruction pipelines. Filmmakers can generate multi view observations of a space, then use reconstruction tools like WorldMirror to extract point clouds and 3D scene representations.
This workflow is particularly valuable for:
- Generating 3D reference models for VFX teams
- Creating virtual production volumes without physical construction
- Extracting specific props or architectural elements for detailed modeling
- Developing game environments from cinematic concepts
The reconstruction quality depends on WorldPlay's consistent geometry—inconsistent video makes reliable 3D reconstruction impossible. The system's longterm consistency specifically addresses this requirement.
Video Continuation and Extension
WorldPlay can extend existing footage with spatially and temporally consistent continuations. A filmmaker with a 10 second drone shot can use WorldPlay to generate additional footage that maintains the same motion, lighting, and visual style. This enables creative extensions of expensive or difficult to capture footage.
The system preserves appearance, lighting, and motion patterns, making extensions visually seamless. For establishing shots, aerial footage, or location sequences where reshoots are impractical, this capability offers significant production value.
Independent and Low Budget Production
For independent filmmakers, WorldPlay provides capabilities that level the playing field with studio productions:
- No location costs: Generate any environment without travel, permits, or location fees
- Instant iteration: Test dozens of visual approaches in the time a traditional scout takes
- Risk reduction: Preview complex shots before committing production resources
- Creative experimentation: Explore visual ideas that would be financially impractical to test physically
A microbudget feature can afford Hollywood level pre-visualization. A documentary filmmaker can preview how narration and graphics will integrate with location footage. A music video director can test surreal visual concepts without building expensive practical sets.
RealTime Creative Direction
The 24 FPS generation speed enables realtime creative collaboration. Directors and cinematographers can experiment with visual ideas during production meetings, seeing results instantly rather than waiting for renders. This immediacy accelerates creative decision-making and enables more iterative refinement.
For commercials and branded content with tight turnaround times, realtime generation compresses the concept to delivery timeline. A creative director can generate client presentation materials in hours rather than days, and changes based on client feedback can be implemented immediately.
Limitations and Practical Considerations
WorldPlay currently generates 16 frame chunks (approximately 0.67 seconds per chunk at 24 FPS). While these can be chained for longer sequences, filmmakers should plan around this chunk structure for optimal results.
The system requires specific hardware (GPUs with sufficient VRAM) for local deployment, though Tencent will likely offer cloud based access. Generation quality depends on training data distribution—scenarios far outside the training set may produce less reliable results.
Promptable events are powerful but require careful prompt engineering. Filmmakers should treat text triggered events as creative starting points rather than precise specifications, similar to working with traditional AI image generation.
Open Source and Commercial Availability
HY-World 1.5 is fully open source under the Tencent Hunyuan Community License. The release includes:
- Complete training framework covering data processing, pre-training, middle-training, RL post-training, and distillation
- Inference code for both bidirectional and autoregressive models
- Model weights on Hugging Face
- Engineering optimizations for reduced latency
- Documentation and demo scripts
Commercial Use: Based on licensing patterns across Tencent's Hunyuan model family, commercial use is generally permitted with standard conditions:
- Free for projects under 100 million monthly active users
- Larger deployments require additional licensing from Tencent
- Cannot use outputs to train competing AI models (except other Hunyuan models)
- Must comply with acceptable use policies prohibiting harmful content
Filmmakers, game developers, and content creators can integrate WorldPlay into commercial projects, build tools and services around it, and deploy it for client work. The opensource release provides full transparency into architecture and training, enabling custom modifications and domain specific finetuning.
Dependencies: WorldPlay builds on HunyuanVideo 1.5 (Apache 2.0 license), requiring that base model for inference. The complete stack remains open source and commercially viable.
Hardware Requirements: Local deployment requires GPUs with substantial VRAM. The technical report documents optimizations including quantization, efficient attention mechanisms, and streaming deployment architectures to reduce computational requirements while maintaining quality.
GitHub and Model Access
- GitHub Repository: github.com/Tencent-Hunyuan/HY-WorldPlay
- Hugging Face Models: huggingface.co/tencent/HY-WorldPlay
- Technical Report: HYWorld_1.5_Tech_Report.pdf
- Project Website: 3d-models.hunyuan.tencent.com/world
The repository includes installation instructions, environment setup, model download scripts, and demo code for quick start. Flash Attention integration is supported for faster inference and reduced memory consumption.
Industry Context and Competition
WorldPlay enters a competitive landscape where major players either keep models closed or release systems with significant limitations:
- OpenAI's Sora: Closed system, no realtime interaction, limited availability
- Google Veo 2: API only access, no open weights, expensive at scale
- Runway Gen-4: Commercial service, no model access, per minute pricing
- Kling 1.5: Faster than Sora but still closed, subscription based
- Oasis (Decart/Etched): Open but sacrifices consistency for speed
- Matrix-Game 2.0: Open and fast but limited to specific scenarios
- Gen3C: Good quality but requires lengthy offline processing
WorldPlay is the first system to achieve realtime interaction, longterm consistency, and full opensource availability simultaneously. This combination makes it particularly valuable for filmmakers who need both creative control and production flexibility.
For the AI filmmaking community, the opensource release means:
- No per minute generation costs that accumulate quickly during experimentation
- Ability to finetune on specific visual styles or domains
- Integration into custom pipelines and workflows
- Transparency into model behavior and limitations
- Community driven improvements and extensions
Technical Specifications Summary
Architecture: Diffusion Transformer (DiT) with 3D causal VAE, rotary positional embeddings extended to temporal dimension, flow matching training in latent space
Generation Speed: 24 FPS realtime streaming (4 denoising steps after distillation)
Resolution: Built on HunyuanVideo 1.5 base (768×512 native, super resolution to 1920×1080)
Memory System: Reconstituted context memory with temporal and spatial components, temporal reframing for long range consistency
Control: Dual action representation (discrete keyboard/mouse + continuous camera poses)
Training Data: 320K curated video clips (games, real world 3D, synthetic renders, natural video)
Action Space: Continuous + discrete (solving limitations of pure discrete or pure continuous approaches)
Unique Features: Realtime latency + longterm consistency, first person and third person perspectives, promptable events, 3D reconstruction capability
Future Directions and Roadmap
The technical report notes several areas for future development:
- Extended video generation beyond current chunk lengths
- Multi-agent interaction within generated worlds
- More complex physical dynamics and object interactions
- Improved prompt-to-event accuracy for dynamic scene changes
- Integration with text-to-3D and image-to-3D pipelines
Tencent's Hunyuan team has a track record of rapid iteration (HunyuanWorld 1.0 to 1.1 to 1.5 in months), suggesting these capabilities may arrive quickly.
For filmmakers, the roadmap indicates growing capability that could eventually enable full AI generated sequences with complex interactions, multi character scenes, and sophisticated physics—all while maintaining the realtime exploration and consistency that make WorldPlay unique.
Conclusion
HY-World 1.5 (WorldPlay) solves the fundamental realtime consistency problem that has limited interactive world models. By combining dual action representation, reconstituted context memory, reinforcement learning post-training, and context forcing distillation, Tencent has created a system that runs at 24 FPS while maintaining geometric stability across hundreds of frames.
For AI filmmakers, this represents a practical tool rather than a research curiosity. Realtime generation enables interactive creative exploration. Longterm consistency makes generated content suitable for actual production use. Opensource availability means no usage restrictions, subscription fees, or API limitations.
The applications span pre-visualization, virtual location scouting, dynamic shot planning, background generation, 3D reconstruction, and creative experimentation. Independent filmmakers gain access to capabilities previously limited to major studios. Commercial productions can compress pre-production timelines and reduce costs. Creative experimentation becomes instantly accessible rather than prohibitively expensive.
This isn't the end of traditional filmmaking—it's a new tool that expands what's possible with available resources. Like digital cameras didn't replace cinematography but made it more accessible, realtime AI world generation augments filmmaking capabilities rather than replacing human creativity.
WorldPlay's technical achievements—solving the speed memory tradeoff, achieving stable long horizon generation, enabling realtime interaction—make it a milestone in AI world modeling. The opensource release ensures those capabilities benefit the entire filmmaking community rather than remaining locked behind corporate APIs.
Filmmakers now have a tool that generates explorable worlds in realtime. What they create with it will define the next chapter in AI assisted production.


