Qwen-Image-Layered

Model Introduction

This model is trained based on the model Qwen/Qwen-Image-Layered using the dataset artplus/PrismLayersPro, enabling text-controlled extraction of segmented layers.

For more details about training strategies and implementation, feel free to check our technical blog.

Usage Tips

The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
The native training resolution is 1024x1024; however, inference at other resolutions is supported.
The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.

Demo Examples

Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.

Example 1

Input Image

Prompt	Output Image	Prompt	Output Image
A solid, uniform color with no distinguishable features or objects		Text 'TRICK'
Cloud		Text 'TRICK OR TREAT'
A cartoon skeleton character wearing a purple hat and holding a gift box		Text 'TRICK OR'
A purple hat and a head		A gift box

Example 2

Input Image

Prompt	Output Image	Prompt	Output Image
Blue sky, white clouds, a garden with colorful flowers		Colorful, intricate floral wreath
Girl, wreath, kitten		Girl, kitten

Example 3

Input Image

Prompt	Output Image	Prompt	Output Image
A clear blue sky and a turbulent sea		Text "The Life I Long For"
A seagull		Text "Life"

Inference Code

Install DiffSynth-Studio:

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

Model inference:

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
    prompt,
    seed=0,
    num_inference_steps=30, cfg_scale=4,
    height=1024, width=1024,
    layer_input_image=input_image,
    layer_num=0,
)
images[0].save("image.png")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support