This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
Pipeline Type: HeliosPyramidAutoBlocks
Description: Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios.
This pipeline uses a 4-block architecture that can be customized and extended.
Example Usage
[TODO]
Pipeline Architecture
This modular pipeline is composed of the following blocks:
- text_encoder (
HeliosTextEncoderStep)- Text Encoder step that generates text embeddings to guide the video generation
- vae_encoder (
HeliosPyramidAutoVaeEncoderStep)- Encoder step that encodes video or image inputs. This is an auto pipeline block.
- denoise (
HeliosPyramidAutoCoreDenoiseStep)- Pyramid core denoise step that selects the appropriate denoising block.
- decode (
HeliosDecodeStep)- Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.
Model Components
- text_encoder (
UMT5EncoderModel) - tokenizer (
AutoTokenizer) - guider (
ClassifierFreeGuidance) - vae (
AutoencoderKLWan) - video_processor (
VideoProcessor) - transformer (
HeliosTransformer3DModel) - scheduler (
HeliosScheduler)
Workflow Input Specification
text2video
prompt(str): The prompt or prompts to guide image generation.
image2video
prompt(str): The prompt or prompts to guide image generation.image(Image | list): Reference image(s) for denoising. Can be a single image or list of images.
video2video
prompt(str): The prompt or prompts to guide image generation.video(None): Input video for video-to-video generation
Input/Output Specification
Inputs:
prompt(str): The prompt or prompts to guide image generation.negative_prompt(str, optional): The prompt or prompts not to guide the image generation.max_sequence_length(int, optional, defaults to512): Maximum sequence length for prompt encoding.video(None, optional): Input video for video-to-video generationheight(int, optional, defaults to384): The height in pixels of the generated image.width(int, optional, defaults to640): The width in pixels of the generated image.num_latent_frames_per_chunk(int, optional, defaults to9): Number of latent frames per temporal chunk.generator(Generator, optional): Torch generator for deterministic generation.image(Image | list, optional): Reference image(s) for denoising. Can be a single image or list of images.num_videos_per_prompt(int, optional, defaults to1): Number of videos to generate per prompt.image_latents(Tensor, optional): image latents used to guide the image generation. Can be generated from vae_encoder step.video_latents(Tensor, optional): Encoded video latents for V2V generation.image_noise_sigma_min(float, optional, defaults to0.111): Minimum sigma for image latent noise.image_noise_sigma_max(float, optional, defaults to0.135): Maximum sigma for image latent noise.video_noise_sigma_min(float, optional, defaults to0.111): Minimum sigma for video latent noise.video_noise_sigma_max(float, optional, defaults to0.135): Maximum sigma for video latent noise.num_frames(int, optional, defaults to132): Total number of video frames to generate.history_sizes(list): Sizes of long/mid/short history buffers for temporal context.keep_first_frame(bool, optional, defaults toTrue): Whether to keep the first frame as a prefix in history.pyramid_num_inference_steps_list(list, optional, defaults to[10, 10, 10]): Number of denoising steps per pyramid stage.latents(Tensor, optional): Pre-generated noisy latents for image generation.**denoiser_input_fields(None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.attention_kwargs(dict, optional): Additional kwargs for attention processors.fake_image_latents(Tensor, optional): Fake image latents used as history seed for I2V generation.output_type(str, optional, defaults tonp): Output format: 'pil', 'np', 'pt'.
Outputs:
videos(list): The generated videos.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support