YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Video Lab β€” Wan2.2-Animate-14B FP4 Quantization & Inference

Naive FP4 quantization for Wan2.2-Animate-14B, enabling 17B-parameter video generation on a single consumer GPU (24-32 GB VRAM). Includes distill LoRA integration for 6-step inference.

What This Does

Takes a reference character image + a driving dance video β†’ generates a video of the character performing the dance moves. Like Kling 3's motion control, but open-source.

Wan2.2-Animate-14B is Kuaishou's open-source equivalent of Kling 3. It has 17.3B transformer parameters (marketed as "27B MoE" but MoE is pipeline-level, not weight-level β€” the transformer itself has no expert routing).

Key Results

Method Time (81 frames) VRAM Steps
BF16 (sequential CPU offload) 385s 5.5 GB 20
Naive FP4 (model CPU offload) 272s 15.6 GB 20
FP4 + Distill LoRA 113s 15.6 GB 6
  • 480x832 portrait resolution, 81 frames at 16fps (~5 seconds of video)
  • RTX 5090 (32 GB VRAM), 62 GB RAM
  • FP4 quality is visually identical to BF16

Quick Start

Prerequisites

pip install diffusers transformers accelerate safetensors sentencepiece
pip install torch  # >= 2.4.0
pip install decord onnxruntime loguru moviepy opencv-python-headless matplotlib ftfy

1. Download the BF16 model (one-time, ~82 GB)

huggingface-cli download Wan-AI/Wan2.2-Animate-14B-Diffusers --local-dir Wan2.2-Animate-14B-Diffusers

2. Download preprocessing checkpoints (one-time)

# From our repo (animation mode only β€” det + pose2d, ~2.5 GB):
huggingface-cli download treksis/video-lab --include "process_checkpoint/**" --local-dir .

# Or from official source (includes SAM2 for replacement mode, ~4 GB total):
# huggingface-cli download Wan-AI/Wan2.2-Animate-14B --include "process_checkpoint/**" --local-dir Wan2.2-Animate-14B-orig

3. Download pre-quantized weights (or re-quantize yourself)

# Option A: Download pre-quantized transformer (9.9 GB) + text encoder (6.66 GB)
huggingface-cli download treksis/video-lab --include "Wan2.2-Animate-14B-FP4/**" --local-dir .

# Option B: Re-quantize from BF16 (takes ~30 seconds on GPU)
python quantize_naive_fp4.py --model-dir Wan2.2-Animate-14B-Diffusers/transformer --output-dir Wan2.2-Animate-14B-FP4

4. Download distill LoRA (optional, for 6-step inference)

# From our repo (mirrored):
huggingface-cli download treksis/video-lab --include "distill_lora/**" --local-dir .

# Or from original source:
# huggingface-cli download Kijai/WanVideo_comfy Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors --local-dir distill_lora

5. Clone Wan2.2 repo (for preprocessing scripts)

git clone --depth 1 https://github.com/Wan-Video/Wan2.2.git

6. Preprocess your driving video

python preprocess_animate.py \
  --ckpt-path Wan2.2-Animate-14B-orig/process_checkpoint \
  --video-path your_dance_video.mp4 \
  --refer-path your_character_image.jpg \
  --save-path preprocess_output \
  --width 480 --height 832 --fps 16

With retargeting (adapts skeleton proportions to reference β€” use full-body reference image!):

python preprocess_animate.py \
  --ckpt-path Wan2.2-Animate-14B-orig/process_checkpoint \
  --video-path your_dance_video.mp4 \
  --refer-path your_character_fullbody.jpg \
  --save-path preprocess_output \
  --width 480 --height 832 --fps 16 \
  --retarget

Note on retargeting: The retargeted output matches the framing of the reference image. If your reference is a close-up portrait, the output will be zoomed in. Use a full-body standing photo for full-body dance output.

7. Run inference

FP4 + Distill LoRA (recommended, 6 steps, ~113s):

python inference_animate_distill.py

FP4 only (20 steps, ~272s):

python inference_animate_fp4.py

BF16 baseline (20 steps, ~385s, needs 60+ GB RAM):

python inference_animate.py

File Reference

Scripts

File Description
quantize_naive_fp4.py FP4 RTN quantization. GPU-accelerated, takes ~30s. Also contains FP4Linear class for inference.
preprocess_animate.py Preprocessing: extracts pose skeleton + face crops from driving video. Uses YOLOv10+ViTPose from Wan's official checkpoint. Supports optional pose retargeting.
inference_animate.py BF16 baseline inference with sequential CPU offloading.
inference_animate_fp4.py Naive FP4 inference with model-level CPU offloading.
inference_animate_distill.py Best version. Merges lightx2v distill LoRA β†’ FP4 quantization β†’ 6-step inference.
inference_animate_nunchaku.py Nunchaku SVDQW4A4Linear attempt. Fast (1s/step) but output is noise due to incorrect weight packing. Needs DeepCompressor converter for correct MMA-aligned format.
preprocess_dwpose.py DWPose attempt (incomplete β€” needs mmcv/mmpose/mmdet).

Weights

File Size Description
Wan2.2-Animate-14B-FP4/quantized_transformer.safetensors 9.9 GB Pre-quantized FP4 transformer (without distill LoRA merged). 525 linear layers quantized, group_size=32.
Wan2.2-Animate-14B-FP4/quantized_text_encoder.safetensors 6.66 GB INT4 AWQ quantized UMT5 text encoder. 168 linear layers quantized, group_size=128. Config embedded in metadata.
Wan2.2-Animate-14B-FP4/quantize_config.json <1 KB Quantization metadata.
distill_lora/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors 704 MB Distillation LoRA for 6-step inference. Originally from Kijai/WanVideo_comfy. ComfyUI format (lora_down/lora_up + diff_b).
process_checkpoint/det/yolov10m.onnx 59 MB YOLOv10m person detector (ONNX). Used by preprocess_animate.py.
process_checkpoint/pose2d/vitpose_h_wholebody.onnx/ 2.4 GB ViTPose-H wholebody pose estimator (ONNX, external data format β€” 394 files). Used by preprocess_animate.py.

Documentation

File Description
README.md This file.
RESEARCH_NOTES.md Detailed research notes: architecture analysis, Wan2.2 model variants, LTX-2.3 analysis, Kling 3 relationship, performance benchmarks.

Architecture Notes

Wan2.2-Animate-14B Transformer

  • 17.3B parameters (not 27B β€” MoE routing is at pipeline level)
  • 40 transformer blocks, each: self-attention + cross-attention + FFN
  • 525 quantizable linear layers
  • inner_dim = 5120 (40 heads x 128 dim)
  • ffn_dim = 13824
  • Text encoder: UMT5 (~10.7 GB, NOT the same as FLUX.1's T5-XXL)
  • Image encoder: CLIPVisionModel (~1.2 GB)
  • VAE: AutoencoderKLWan (~0.5 GB)
  • Pipeline class: WanAnimatePipeline (diffusers)

VRAM Budget

Component BF16 FP4
Transformer 34.5 GB ~14 GB (9.9 GB weights + dequant overhead)
Text encoder (UMT5) 10.7 GB 10.7 GB (or ~3 GB with INT4 AWQ)
Image encoder (CLIP) 1.2 GB 1.2 GB
VAE 0.5 GB 0.5 GB
Total ~47 GB ~16-26 GB

BF16 doesn't fit on any consumer GPU β†’ needs sequential CPU offloading (slow). FP4 fits in 24-32 GB VRAM β†’ model-level offloading or no offloading (fast).

Naive FP4 Quantization

  • Round-To-Nearest (RTN) to FP4 E2M1 format
  • Per-group scales with group_size=32
  • Packed as uint8 (2 x 4-bit values per byte)
  • At inference time: dequantize to BF16 before matmul (fake quantization)
  • No calibration data needed β€” quantizes in ~30 seconds on GPU

Distill LoRA (lightx2v)

  • Trained for I2V but works on Animate (same base architecture)
  • ComfyUI format with lora_down/lora_up + diff_b (bias diffs)
  • Merged into BF16 weights before FP4 quantization
  • Scheduler: FlowMatchEulerDiscreteScheduler with shift=5.0
  • Reduces 20 steps β†’ 6 steps with comparable quality
  • Key mapping: ComfyUI self_attn.q β†’ diffusers attn1.to_q, etc.

Preprocessing Pipeline

Animation mode (what we use):

  1. YOLOv10m detects person bounding box
  2. ViTPose-H extracts wholebody keypoints (body + hands + face)
  3. Draw skeleton on black canvas β†’ src_pose.mp4
  4. Crop face region β†’ src_face.mp4
  5. (Optional) Retarget pose proportions to reference image

Does NOT need (only for replacement mode):

  • SAM2 (segmentation)
  • FluxKontext (only for retarget+edit mode)

ComfyUI Workflow Analysis (ICY WAN ANIMATE)

The ICY ComfyUI workflow uses these tricks we replicated:

Feature ComfyUI Our Implementation
Model Wan2.2-Animate-14B FP8 Same model, naive FP4
Distill LoRA lightx2v rank64/128 lightx2v rank64, merged before quant
Steps 6 6
Sampler res_2s FlowMatchEulerDiscrete (closest equivalent)
Scheduler beta57 shift=5.0 (approximate)
Face swap Klein 9B post-processing Not implemented (future work)
Enhancer WanAnimateModelEnhancer Not implemented

Known Issues & Future Work

  1. Nunchaku packing: The SVDQW4A4Linear version produces noise because our naive packing doesn't match Nunchaku's MMA-aligned format. Fix: use convert_to_nunchaku_w4x4y16_linear_state_dict from DeepCompressor (same approach as FLUX.2-Klein).

  2. UMT5 quantization: Currently the text encoder stays in BF16 (10.7 GB). Can be quantized with W4Linear AWQ to ~3 GB (code exists in inference_animate_nunchaku.py). This would eliminate the need for any CPU offloading.

  3. Retargeting + full-body reference: Retargeting improves quality significantly but requires a full-body reference image (close-ups cause zooming).

  4. RIFE frame interpolation: Output is 16fps. Adding RIFE would double to 32fps for smoother video.

  5. Klein face swap: The ComfyUI workflow runs FLUX.2-Klein on output frames to improve face quality. We have the model but didn't implement this step.

  6. LTX-2.3: Deferred. 22B params, ComfyUI-only format, community diffusers conversion available at dg845/LTX-2.3-Diffusers.


Related Work

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for treksis/video-lab