This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: HeliosPyramidAutoBlocks

Description: Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (HeliosTextEncoderStep)
    • Text Encoder step that generates text embeddings to guide the video generation
  2. vae_encoder (HeliosPyramidAutoVaeEncoderStep)
    • Encoder step that encodes video or image inputs. This is an auto pipeline block.
  3. denoise (HeliosPyramidAutoCoreDenoiseStep)
    • Pyramid core denoise step that selects the appropriate denoising block.
  4. decode (HeliosDecodeStep)
    • Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.

Model Components

  1. text_encoder (UMT5EncoderModel)
  2. tokenizer (AutoTokenizer)
  3. guider (ClassifierFreeGuidance)
  4. vae (AutoencoderKLWan)
  5. video_processor (VideoProcessor)
  6. transformer (HeliosTransformer3DModel)
  7. scheduler (HeliosScheduler)

Workflow Input Specification

text2video
  • prompt (str): The prompt or prompts to guide image generation.
image2video
  • prompt (str): The prompt or prompts to guide image generation.
  • image (Image | list): Reference image(s) for denoising. Can be a single image or list of images.
video2video
  • prompt (str): The prompt or prompts to guide image generation.
  • video (None): Input video for video-to-video generation

Input/Output Specification

Inputs:

  • prompt (str): The prompt or prompts to guide image generation.
  • negative_prompt (str, optional): The prompt or prompts not to guide the image generation.
  • max_sequence_length (int, optional, defaults to 512): Maximum sequence length for prompt encoding.
  • video (None, optional): Input video for video-to-video generation
  • height (int, optional, defaults to 384): The height in pixels of the generated image.
  • width (int, optional, defaults to 640): The width in pixels of the generated image.
  • num_latent_frames_per_chunk (int, optional, defaults to 9): Number of latent frames per temporal chunk.
  • generator (Generator, optional): Torch generator for deterministic generation.
  • image (Image | list, optional): Reference image(s) for denoising. Can be a single image or list of images.
  • num_videos_per_prompt (int, optional, defaults to 1): Number of videos to generate per prompt.
  • image_latents (Tensor, optional): image latents used to guide the image generation. Can be generated from vae_encoder step.
  • video_latents (Tensor, optional): Encoded video latents for V2V generation.
  • image_noise_sigma_min (float, optional, defaults to 0.111): Minimum sigma for image latent noise.
  • image_noise_sigma_max (float, optional, defaults to 0.135): Maximum sigma for image latent noise.
  • video_noise_sigma_min (float, optional, defaults to 0.111): Minimum sigma for video latent noise.
  • video_noise_sigma_max (float, optional, defaults to 0.135): Maximum sigma for video latent noise.
  • num_frames (int, optional, defaults to 132): Total number of video frames to generate.
  • history_sizes (list): Sizes of long/mid/short history buffers for temporal context.
  • keep_first_frame (bool, optional, defaults to True): Whether to keep the first frame as a prefix in history.
  • pyramid_num_inference_steps_list (list, optional, defaults to [10, 10, 10]): Number of denoising steps per pyramid stage.
  • latents (Tensor, optional): Pre-generated noisy latents for image generation.
  • **denoiser_input_fields (None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
  • attention_kwargs (dict, optional): Additional kwargs for attention processors.
  • fake_image_latents (Tensor, optional): Fake image latents used as history seed for I2V generation.
  • output_type (str, optional, defaults to np): Output format: 'pil', 'np', 'pt'.

Outputs:

  • videos (list): The generated videos.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support