File size: 3,186 Bytes
d9804ac
 
 
 
 
 
 
fa40986
 
 
 
d9804ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa40986
 
 
d9804ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
library_name: diffusers
tags:
- modular-diffusers
- diffusers
- qwenimage-layered
- text-to-image
- modular-diffusers
- diffusers
- qwenimage-layered
- text-to-image
---
This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

**Pipeline Type**: QwenImageLayeredAutoBlocks

**Description**: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.

This pipeline uses a 4-block architecture that can be customized and extended.

## Example Usage

[TODO]

## Pipeline Architecture

This modular pipeline is composed of the following blocks:

1. **text_encoder** (`QwenImageLayeredTextEncoderStep`)
   - QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
2. **vae_encoder** (`QwenImageLayeredVaeEncoderStep`)
   - Vae encoder step that encode the image inputs into their latent representations.
3. **denoise** (`QwenImageLayeredCoreDenoiseStep`)
   - Core denoising workflow for QwenImage-Layered img2img task.
4. **decode** (`QwenImageLayeredDecoderStep`)
   - Decode unpacked latents (B, C, layers+1, H, W) into layer images. 

## Model Components

1. image_resize_processor (`VaeImageProcessor`)
2. text_encoder (`Qwen2_5_VLForConditionalGeneration`)
3. processor (`Qwen2VLProcessor`)
4. tokenizer (`Qwen2Tokenizer`): The tokenizer to use
5. guider (`ClassifierFreeGuidance`)
6. image_processor (`VaeImageProcessor`)
7. vae (`AutoencoderKLQwenImage`)
8. pachifier (`QwenImageLayeredPachifier`)
9. scheduler (`FlowMatchEulerDiscreteScheduler`)
10. transformer (`QwenImageTransformer2DModel`) 

## Input/Output Specification

**Inputs:**

- `image` (`Image | list`): Reference image(s) for denoising. Can be a single image or list of images.
- `resolution` (`int`, *optional*, defaults to `640`): The target area to resize the image to, can be 1024 or 640
- `prompt` (`str`, *optional*): The prompt or prompts to guide image generation.
- `use_en_prompt` (`bool`, *optional*, defaults to `False`): Whether to use English prompt template
- `negative_prompt` (`str`, *optional*): The prompt or prompts not to guide the image generation.
- `max_sequence_length` (`int`, *optional*, defaults to `1024`): Maximum sequence length for prompt encoding.
- `generator` (`Generator`, *optional*): Torch generator for deterministic generation.
- `num_images_per_prompt` (`int`, *optional*, defaults to `1`): The number of images to generate per prompt.
- `latents` (`Tensor`, *optional*): Pre-generated noisy latents for image generation.
- `layers` (`int`, *optional*, defaults to `4`): Number of layers to extract from the image
- `num_inference_steps` (`int`, *optional*, defaults to `50`): The number of denoising steps.
- `sigmas` (`list`, *optional*): Custom sigmas for the denoising process.
- `attention_kwargs` (`dict`, *optional*): Additional kwargs for attention processors.
- `**denoiser_input_fields` (`None`, *optional*): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
- `output_type` (`str`, *optional*, defaults to `pil`): Output format: 'pil', 'np', 'pt'.

**Outputs:**

- `images` (`list`): Generated images.