Qwen-Image-Layered
Model Introduction
This model is trained based on the model Qwen/Qwen-Image-Layered using the dataset artplus/PrismLayersPro, enabling text-controlled extraction of segmented layers.
For more details about training strategies and implementation, feel free to check our technical blog.
Usage Tips
- The model architecture has been changed from multi-image output to single-image output, producing only the layer relevant to the provided text description.
- The model was trained exclusively on English text, but retains Chinese language understanding capabilities inherited from the base model.
- The native training resolution is 1024x1024; however, inference at other resolutions is supported.
- The model struggles to separate multiple entities that are heavily occluded or overlapping, such as the cartoon skeleton head and hat in the examples.
- The model excels at decomposing poster-like graphics but performs poorly on photographic images, especially those involving complex lighting and shadows.
- The model supports negative prompts—users can specify content they wish to exclude via negative prompt descriptions.
Demo Examples
Some images contain white text on light backgrounds. ModelScope users should click the "☀︎" icon in the top-right corner to switch to dark mode for better visibility.
Example 1
Example 2
Example 3
Inference Code
Install DiffSynth-Studio:
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
Model inference:
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from PIL import Image
import torch, requests
pipe = QwenImagePipeline.from_pretrained(
torch_dtype=torch.bfloat16,
device="cuda",
model_configs=[
ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Layered-Control", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
ModelConfig(model_id="Qwen/Qwen-Image-Layered", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
],
processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
prompt = "A cartoon skeleton character wearing a purple hat and holding a gift box"
input_image = requests.get("https://modelscope.oss-cn-beijing.aliyuncs.com/resource/images/trick_or_treat.png", stream=True).raw
input_image = Image.open(input_image).convert("RGBA").resize((1024, 1024))
input_image.save("image_input.png")
images = pipe(
prompt,
seed=0,
num_inference_steps=30, cfg_scale=4,
height=1024, width=1024,
layer_input_image=input_image,
layer_num=0,
)
images[0].save("image.png")
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support


















