DreamX-World-5B-Cam

DreamX-World Teaser

Project Page GitHub

Model Description

DreamX-World is a general-purpose world model for interactive world simulation. It generates diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.

DreamX-World-5B-Cam is the 5B-parameter camera-control variant, built on top of Wan2.2-T2V-5B. Given a single input image, a text description, and camera action commands, it generates high-quality videos with precise camera trajectory control using PRoPE (Projective Position Encoding) for camera conditioning.

Key Features

  • Camera-Controllable Video Generation: Precise 6-DoF camera control via action commands (forward, backward, left, right, up, down, tilt, pan, etc.)
  • Realistic & Fantasy Worlds: Generates indoor, urban, natural, architectural, game-like, sci-fi, and stylized environments
  • Flexible Resolution & Duration: Generates videos at 704×1280 resolution, 5s at 24 FPS or 5s at 16 FPS, with support for up to 7.5s at 16 FPS

How to Use

Requirements

pip install -r requirements.txt

Key dependencies:

  • torch==2.5.1
  • diffusers>=0.30.1
  • transformers>=4.46.2
  • xfuser==0.4.1
  • flash_attn==2.8.3

Prerequisites

Download the base model weights:

Inference

  1. Prepare your input JSON file (see configs/dreamx/eval.json for examples):
{
  "image_path": "./demo/your_image.png",
  "caption": "Style: Photorealistic. A description of the scene...",
  "action_seq": ["w", "wj"],
  "action_speed_list": [4, 6]
}
  1. Run inference:
# ======================== Model Path ========================
MODEL_NAME="./Wan2.2-TI2V-5B"
CONFIG_PATH="./configs/wan2.2/wan_ti2v_5b.yaml"
TRANSFORMER_PATH="./Dreamx-5b/"

# ====================== Basic Settings ======================
INPUT_DIR="./configs/dreamx/eval.json"
OUTPUT_DIR="./outputs/"
SAMPLE_HEIGHT=704
SAMPLE_WIDTH=1280
VIDEO_LENGTH=121       # 121 frames = 5s @ 24fps, 81 frames = 5s @ 16fps
FPS=24
GUIDANCE_SCALE=3.0
NUM_INFERENCE_STEPS=50
SEED=42

# ====================== Camera Control ======================
CAM_METHOD="prope"
ADD_CONTROL_ADAPTER="--add_control_adapter"

# ======================== Multi-GPU ========================
WEIGHT_DTYPE="bfloat16"
ULYSSES_DEGREE=8
RING_DEGREE=1
CUDA_DEVICES="0,1,2,3,4,5,6,7"
sh inference_dreamx_5b.sh

Camera Action Commands

Action Description
w Move forward
s Move backward
a Move left
d Move right
j Tilt down
k Tilt up
l Pan right
h Pan left

Actions can be composed (e.g., wj = move forward + tilt down, dj = move right + tilt down).

Technical Specifications

Attribute Value
Architecture Transformer-based DiT (Diffusion Transformer)
Parameters ~5B
Base Model Wan2.2-5B-TI2V
Camera Control PRoPE (Projective Position Encoding)
VAE AutoencoderKLWan3_8 (temporal compression 4×, spatial compression 16×)
Text Encoder UMT5-XXL
Scheduler Flow Matching Euler Discrete
Precision BFloat16
Max Resolution 704 × 1280
Frame Count 121 (5s@24fps) / 81 (5s@16fps), up to 7.5s@16fps
Multi-GPU Ulysses + Ring parallelism via xfuser

License

This model is released under the MIT License.

Acknowledgement

We thank the Wan Team for open-sourcing their code and models.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including GD-ML/DreamX-World-5B-Cam