JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

Share this post:
JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition
JD Open Source, the open source division of JD.com, released JoyAI-Image-Edit-Plus on June 23, 2026. The model combines an 8 billion parameter Multimodal Large Language Model with a 16 billion parameter Multimodal Diffusion Transformer to handle instruction guided image editing, including a new capability that sets it apart from the base variant: multi image composition from multiple reference inputs.
The Plus Variant: Multiple Reference Inputs
The base JoyAI-Image-Edit model, released in April 2026, accepts a single image and a text instruction. Edit-Plus extends this: users can provide multiple reference images, and the model produces a coherent edited output that draws from all of them simultaneously.
In practice, this means a filmmaker can supply a background reference, a character reference, and a style reference as separate inputs, and the model composes them into a single edited scene. Standard instruction guided editors collapse at this task because they treat the prompt as a caption for one image. Edit-Plus treats multiple inputs as a shared semantic context from the start of generation.
How the Architecture Works
The MLLM and MDT are not separate modules connected by an adapter. The 8 billion parameter language model provides continuous semantic grounding for the diffusion trajectory throughout every denoising step, not a single conditioning embedding computed once at the start.
Standard instruction guided image editors typically use a frozen CLIP encoder or a lightweight adapter to inject text context into the diffusion model. Because CLIP compresses text into a fixed embedding, complex instructions involving spatial relationships, typographic constraints, or consistency across multiple views are partially lost before generation begins. In JoyAI-Image-Edit-Plus, the full language model's contextual understanding remains active during generation. That is the structural reason the model handles dense typography and complex spatial composition that defeats most single-component editors.
What the Model Handles Well
The JD Open Source README identifies six areas where the model performs strongly: spatial reasoning, long text rendering, multi panel comic generation, dense multilingual typography, multi view generation, and controllable editing.
The typography strength is notable. Most diffusion models degrade on scenes with dense text because text rendering requires precise spatial placement of individual character strokes, not just approximate semantic fidelity. Edit-Plus is specifically optimized for multi line dense layouts, handwritten styles, and mixed language typography across the same image.
Running JoyAI-Image-Edit-Plus
The model is available on HuggingFace as jdopensource/JoyAI-Image-Edit-Plus-Diffusers. The Apache 2.0 license permits research and commercial use. The full codebase, inference scripts, and variant documentation are on GitHub under the jd-opensource/JoyAI-Image repository.
The JoyAI-Image family includes three released variants: JoyAI-Image-Und (image understanding), JoyAI-Image-Edit (single image instruction editing, April 2026), and JoyAI-Image-Edit-Plus (multi image composition, June 23, 2026). All three share the same underlying architectural approach.
For broader context on open source image tools, Krea-2 Raw Turbo takes a different direction, focused on text to image generation speed rather than editing. Qwen Image Edit is the closest architectural comparison: also instruction guided, also open source, but using a different model family and design.
Filmmakers who want AI image tools without a local setup can use the AI FILMS Studio image workspace directly in the browser.
Sources
GitHub: jd-opensource/JoyAI-Image HuggingFace: jdopensource/JoyAI-Image-Edit-Plus-Diffusers Project Page: joyai-image.com
Continue Reading
Video & LipSync
- Video Generator
- Text to Video
- Image to Video
- Start-End Frame to Video
- Draw to Video
- Motion Control
- Video Enhancer
- Video Upscaler
- Video to Video LipSync
- Audio to Video LipSync
- Image to Video LipSync
- Video FaceSwap
- Seedance 2
- Vidu Q3 Pro
- Google Veo 3.1
- Kling 3.0 Pro
- LTX 2.3
- Happy Horse 1.0
- Kling 3.0 Motion
- ByteDance Upscaler
- InfiniteTalk
- InsightFace
.jpg?w=3840)
