JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

June 28, 2026

Share this post:

JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

JD Open Source, the open source division of JD.com, released JoyAI-Image-Edit-Plus on June 23, 2026. The model combines an 8 billion parameter Multimodal Large Language Model with a 16 billion parameter Multimodal Diffusion Transformer to handle instruction guided image editing, including a new capability that sets it apart from the base variant: multi image composition from multiple reference inputs.

JoyAI-Image-Edit-Plus output demonstrating instruction guided image editing and composition — From the JD Open Source GitHub repository

The Plus Variant: Multiple Reference Inputs

The base JoyAI-Image-Edit model, released in April 2026, accepts a single image and a text instruction. Edit-Plus extends this: users can provide multiple reference images, and the model produces a coherent edited output that draws from all of them simultaneously.

In practice, this means a filmmaker can supply a background reference, a character reference, and a style reference as separate inputs, and the model composes them into a single edited scene. Standard instruction guided editors collapse at this task because they treat the prompt as a caption for one image. Edit-Plus treats multiple inputs as a shared semantic context from the start of generation.

JoyAI-Image-Edit-Plus example showing multi image composition from reference inputs — From the JD Open Source GitHub repository

How the Architecture Works

The MLLM and MDT are not separate modules connected by an adapter. The 8 billion parameter language model provides continuous semantic grounding for the diffusion trajectory throughout every denoising step, not a single conditioning embedding computed once at the start.

Standard instruction guided image editors typically use a frozen CLIP encoder or a lightweight adapter to inject text context into the diffusion model. Because CLIP compresses text into a fixed embedding, complex instructions involving spatial relationships, typographic constraints, or consistency across multiple views are partially lost before generation begins. In JoyAI-Image-Edit-Plus, the full language model's contextual understanding remains active during generation. That is the structural reason the model handles dense typography and complex spatial composition that defeats most single-component editors.

JoyAI-Image-Edit-Plus output showing multilingual text rendering and spatial editing capabilities — From the JD Open Source GitHub repository

What the Model Handles Well

The JD Open Source README identifies six areas where the model performs strongly: spatial reasoning, long text rendering, multi panel comic generation, dense multilingual typography, multi view generation, and controllable editing.

The typography strength is notable. Most diffusion models degrade on scenes with dense text because text rendering requires precise spatial placement of individual character strokes, not just approximate semantic fidelity. Edit-Plus is specifically optimized for multi line dense layouts, handwritten styles, and mixed language typography across the same image.

Running JoyAI-Image-Edit-Plus

The model is available on HuggingFace as jdopensource/JoyAI-Image-Edit-Plus-Diffusers. The Apache 2.0 license permits research and commercial use. The full codebase, inference scripts, and variant documentation are on GitHub under the jd-opensource/JoyAI-Image repository.

The JoyAI-Image family includes three released variants: JoyAI-Image-Und (image understanding), JoyAI-Image-Edit (single image instruction editing, April 2026), and JoyAI-Image-Edit-Plus (multi image composition, June 23, 2026). All three share the same underlying architectural approach.

For broader context on open source image tools, Krea-2 Raw Turbo takes a different direction, focused on text to image generation speed rather than editing. Qwen Image Edit is the closest architectural comparison: also instruction guided, also open source, but using a different model family and design.

Filmmakers who want AI image tools without a local setup can use the AI FILMS Studio image workspace directly in the browser.

AI FILMS Studio video generation workspace

Try AI FILMS Studio

Generate text-to-video and image-to-video with the latest AI models in the video workspace.

Nodes Graph Editor

Build custom AI workflows by connecting models visually in the Nodes Graph Editor.

Sources

GitHub: jd-opensource/JoyAI-Image HuggingFace: jdopensource/JoyAI-Image-Edit-Plus-Diffusers Project Page: joyai-image.com

Continue Reading

Jun 28, 2026

A24 Defends Google AI Partnership After Fan Backlash

A24 responded to fan backlash over its $75 million Google DeepMind deal, saying it partnered to 'dictate what tools get built for artists.'

Jun 28, 2026

AI for Good Film Festival 2026: 1,300 Submissions, 10 Finalists, Geneva July 9

The ITU AI for Good Film Festival 2026 selected 10 finalist films from 1,300+ submissions across 10 countries, screening July 9 in Geneva.

Jun 28, 2026

FastWan-QAD: Ultrafast Open Source Video Generation by Hao AI Lab

FastWan-QAD by Hao AI Lab generates a 5 second 480p video in 3.4 seconds on RTX 4090 using quantization aware distillation. Apache 2.0.

JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

JoyAI-Image-Edit-Plus: Instruction Guided Image Editing with Multi Image Composition

The Plus Variant: Multiple Reference Inputs

How the Architecture Works

What the Model Handles Well

Running JoyAI-Image-Edit-Plus

Sources

Continue Reading

A24 Defends Google AI Partnership After Fan Backlash

AI for Good Film Festival 2026: 1,300 Submissions, 10 Finalists, Geneva July 9

FastWan-QAD: Ultrafast Open Source Video Generation by Hao AI Lab

Video & LipSync

Image & Edit

Speech & Voice

Music & Sound Effects