EditorNodesPricingBlog

JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

June 28, 2026
JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

Share this post:

JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

JD Open Source, the open source division of JD.com, released JoyAI-Image-Edit-Plus on June 23, 2026. The model combines an 8 billion parameter Multimodal Large Language Model with a 16 billion parameter Multimodal Diffusion Transformer to handle instruction guided image editing, including a new capability that sets it apart from the base variant: multi image composition from multiple reference inputs.

JoyAI-Image-Edit-Plus output demonstrating instruction guided image editing and composition
From the JD Open Source GitHub repository

The Plus Variant: Multiple Reference Inputs

The base JoyAI-Image-Edit model, released in April 2026, accepts a single image and a text instruction. Edit-Plus extends this: users can provide multiple reference images, and the model produces a coherent edited output that draws from all of them simultaneously.

In practice, this means a filmmaker can supply a background reference, a character reference, and a style reference as separate inputs, and the model composes them into a single edited scene. Standard instruction guided editors collapse at this task because they treat the prompt as a caption for one image. Edit-Plus treats multiple inputs as a shared semantic context from the start of generation.

JoyAI-Image-Edit-Plus example showing multi image composition from reference inputs
From the JD Open Source GitHub repository

How the Architecture Works

The MLLM and MDT are not separate modules connected by an adapter. The 8 billion parameter language model provides continuous semantic grounding for the diffusion trajectory throughout every denoising step, not a single conditioning embedding computed once at the start.

Standard instruction guided image editors typically use a frozen CLIP encoder or a lightweight adapter to inject text context into the diffusion model. Because CLIP compresses text into a fixed embedding, complex instructions involving spatial relationships, typographic constraints, or consistency across multiple views are partially lost before generation begins. In JoyAI-Image-Edit-Plus, the full language model's contextual understanding remains active during generation. That is the structural reason the model handles dense typography and complex spatial composition that defeats most single-component editors.

JoyAI-Image-Edit-Plus output showing multilingual text rendering and spatial editing capabilities
From the JD Open Source GitHub repository

What the Model Handles Well

The JD Open Source README identifies six areas where the model performs strongly: spatial reasoning, long text rendering, multi panel comic generation, dense multilingual typography, multi view generation, and controllable editing.

The typography strength is notable. Most diffusion models degrade on scenes with dense text because text rendering requires precise spatial placement of individual character strokes, not just approximate semantic fidelity. Edit-Plus is specifically optimized for multi line dense layouts, handwritten styles, and mixed language typography across the same image.

Running JoyAI-Image-Edit-Plus

The model is available on HuggingFace as jdopensource/JoyAI-Image-Edit-Plus-Diffusers. The Apache 2.0 license permits research and commercial use. The full codebase, inference scripts, and variant documentation are on GitHub under the jd-opensource/JoyAI-Image repository.

The JoyAI-Image family includes three released variants: JoyAI-Image-Und (image understanding), JoyAI-Image-Edit (single image instruction editing, April 2026), and JoyAI-Image-Edit-Plus (multi image composition, June 23, 2026). All three share the same underlying architectural approach.

For broader context on open source image tools, Krea-2 Raw Turbo takes a different direction, focused on text to image generation speed rather than editing. Qwen Image Edit is the closest architectural comparison: also instruction guided, also open source, but using a different model family and design.

Filmmakers who want AI image tools without a local setup can use the AI FILMS Studio image workspace directly in the browser.


Sources

GitHub: jd-opensource/JoyAI-Image HuggingFace: jdopensource/JoyAI-Image-Edit-Plus-Diffusers Project Page: joyai-image.com