YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Video Lab — Wan2.2-Animate-14B FP4 Quantization & Inference

Naive FP4 quantization for Wan2.2-Animate-14B, enabling 17B-parameter video generation on a single consumer GPU (24-32 GB VRAM). Includes distill LoRA integration for 6-step inference.

What This Does

Takes a reference character image + a driving dance video → generates a video of the character performing the dance moves. Like Kling 3's motion control, but open-source.

Wan2.2-Animate-14B is Kuaishou's open-source equivalent of Kling 3. It has 17.3B transformer parameters (marketed as "27B MoE" but MoE is pipeline-level, not weight-level — the transformer itself has no expert routing).

Key Results

Method	Time (81 frames)	VRAM	Steps
BF16 (sequential CPU offload)	385s	5.5 GB	20
Naive FP4 (model CPU offload)	272s	15.6 GB	20
FP4 + Distill LoRA	113s	15.6 GB	6

480x832 portrait resolution, 81 frames at 16fps (~5 seconds of video)
RTX 5090 (32 GB VRAM), 62 GB RAM
FP4 quality is visually identical to BF16

Quick Start

Prerequisites

pip install diffusers transformers accelerate safetensors sentencepiece
pip install torch  # >= 2.4.0
pip install decord onnxruntime loguru moviepy opencv-python-headless matplotlib ftfy

1. Download the BF16 model (one-time, ~82 GB)

huggingface-cli download Wan-AI/Wan2.2-Animate-14B-Diffusers --local-dir Wan2.2-Animate-14B-Diffusers

2. Download preprocessing checkpoints (one-time)

# From our repo (animation mode only — det + pose2d, ~2.5 GB):
huggingface-cli download treksis/video-lab --include "process_checkpoint/**" --local-dir .

# Or from official source (includes SAM2 for replacement mode, ~4 GB total):
# huggingface-cli download Wan-AI/Wan2.2-Animate-14B --include "process_checkpoint/**" --local-dir Wan2.2-Animate-14B-orig

3. Download pre-quantized weights (or re-quantize yourself)

# Option A: Download pre-quantized transformer (9.9 GB) + text encoder (6.66 GB)
huggingface-cli download treksis/video-lab --include "Wan2.2-Animate-14B-FP4/**" --local-dir .

# Option B: Re-quantize from BF16 (takes ~30 seconds on GPU)
python quantize_naive_fp4.py --model-dir Wan2.2-Animate-14B-Diffusers/transformer --output-dir Wan2.2-Animate-14B-FP4

4. Download distill LoRA (optional, for 6-step inference)

# From our repo (mirrored):
huggingface-cli download treksis/video-lab --include "distill_lora/**" --local-dir .

# Or from original source:
# huggingface-cli download Kijai/WanVideo_comfy Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors --local-dir distill_lora

5. Clone Wan2.2 repo (for preprocessing scripts)

git clone --depth 1 https://github.com/Wan-Video/Wan2.2.git

6. Preprocess your driving video

python preprocess_animate.py \
  --ckpt-path Wan2.2-Animate-14B-orig/process_checkpoint \
  --video-path your_dance_video.mp4 \
  --refer-path your_character_image.jpg \
  --save-path preprocess_output \
  --width 480 --height 832 --fps 16

With retargeting (adapts skeleton proportions to reference — use full-body reference image!):

python preprocess_animate.py \
  --ckpt-path Wan2.2-Animate-14B-orig/process_checkpoint \
  --video-path your_dance_video.mp4 \
  --refer-path your_character_fullbody.jpg \
  --save-path preprocess_output \
  --width 480 --height 832 --fps 16 \
  --retarget

Note on retargeting: The retargeted output matches the framing of the reference image. If your reference is a close-up portrait, the output will be zoomed in. Use a full-body standing photo for full-body dance output.

7. Run inference

FP4 + Distill LoRA (recommended, 6 steps, ~113s):

python inference_animate_distill.py

FP4 only (20 steps, ~272s):

python inference_animate_fp4.py

BF16 baseline (20 steps, ~385s, needs 60+ GB RAM):

python inference_animate.py

File Reference

Scripts

File	Description
`quantize_naive_fp4.py`	FP4 RTN quantization. GPU-accelerated, takes ~30s. Also contains `FP4Linear` class for inference.
`preprocess_animate.py`	Preprocessing: extracts pose skeleton + face crops from driving video. Uses YOLOv10+ViTPose from Wan's official checkpoint. Supports optional pose retargeting.
`inference_animate.py`	BF16 baseline inference with sequential CPU offloading.
`inference_animate_fp4.py`	Naive FP4 inference with model-level CPU offloading.
`inference_animate_distill.py`	Best version. Merges lightx2v distill LoRA → FP4 quantization → 6-step inference.
`inference_animate_nunchaku.py`	Nunchaku SVDQW4A4Linear attempt. Fast (1s/step) but output is noise due to incorrect weight packing. Needs DeepCompressor converter for correct MMA-aligned format.
`preprocess_dwpose.py`	DWPose attempt (incomplete — needs mmcv/mmpose/mmdet).

Weights

File	Size	Description
`Wan2.2-Animate-14B-FP4/quantized_transformer.safetensors`	9.9 GB	Pre-quantized FP4 transformer (without distill LoRA merged). 525 linear layers quantized, group_size=32.
`Wan2.2-Animate-14B-FP4/quantized_text_encoder.safetensors`	6.66 GB	INT4 AWQ quantized UMT5 text encoder. 168 linear layers quantized, group_size=128. Config embedded in metadata.
`Wan2.2-Animate-14B-FP4/quantize_config.json`	<1 KB	Quantization metadata.
`distill_lora/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors`	704 MB	Distillation LoRA for 6-step inference. Originally from `Kijai/WanVideo_comfy`. ComfyUI format (lora_down/lora_up + diff_b).
`process_checkpoint/det/yolov10m.onnx`	59 MB	YOLOv10m person detector (ONNX). Used by `preprocess_animate.py`.
`process_checkpoint/pose2d/vitpose_h_wholebody.onnx/`	2.4 GB	ViTPose-H wholebody pose estimator (ONNX, external data format — 394 files). Used by `preprocess_animate.py`.

Documentation

File	Description
`README.md`	This file.
`RESEARCH_NOTES.md`	Detailed research notes: architecture analysis, Wan2.2 model variants, LTX-2.3 analysis, Kling 3 relationship, performance benchmarks.

Architecture Notes

Wan2.2-Animate-14B Transformer

17.3B parameters (not 27B — MoE routing is at pipeline level)
40 transformer blocks, each: self-attention + cross-attention + FFN
525 quantizable linear layers
inner_dim = 5120 (40 heads x 128 dim)
ffn_dim = 13824
Text encoder: UMT5 (~10.7 GB, NOT the same as FLUX.1's T5-XXL)
Image encoder: CLIPVisionModel (~1.2 GB)
VAE: AutoencoderKLWan (~0.5 GB)
Pipeline class: WanAnimatePipeline (diffusers)

VRAM Budget

Component	BF16	FP4
Transformer	34.5 GB	~14 GB (9.9 GB weights + dequant overhead)
Text encoder (UMT5)	10.7 GB	10.7 GB (or ~3 GB with INT4 AWQ)
Image encoder (CLIP)	1.2 GB	1.2 GB
VAE	0.5 GB	0.5 GB
Total	~47 GB	~16-26 GB

BF16 doesn't fit on any consumer GPU → needs sequential CPU offloading (slow). FP4 fits in 24-32 GB VRAM → model-level offloading or no offloading (fast).

Naive FP4 Quantization

Round-To-Nearest (RTN) to FP4 E2M1 format
Per-group scales with group_size=32
Packed as uint8 (2 x 4-bit values per byte)
At inference time: dequantize to BF16 before matmul (fake quantization)
No calibration data needed — quantizes in ~30 seconds on GPU

Distill LoRA (lightx2v)

Trained for I2V but works on Animate (same base architecture)
ComfyUI format with lora_down/lora_up + diff_b (bias diffs)
Merged into BF16 weights before FP4 quantization
Scheduler: FlowMatchEulerDiscreteScheduler with shift=5.0
Reduces 20 steps → 6 steps with comparable quality
Key mapping: ComfyUI self_attn.q → diffusers attn1.to_q, etc.

Preprocessing Pipeline

Animation mode (what we use):

YOLOv10m detects person bounding box
ViTPose-H extracts wholebody keypoints (body + hands + face)
Draw skeleton on black canvas → src_pose.mp4
Crop face region → src_face.mp4
(Optional) Retarget pose proportions to reference image

Does NOT need (only for replacement mode):

SAM2 (segmentation)
FluxKontext (only for retarget+edit mode)

ComfyUI Workflow Analysis (ICY WAN ANIMATE)

The ICY ComfyUI workflow uses these tricks we replicated:

Feature	ComfyUI	Our Implementation
Model	Wan2.2-Animate-14B FP8	Same model, naive FP4
Distill LoRA	lightx2v rank64/128	lightx2v rank64, merged before quant
Steps	6	6
Sampler	res_2s	FlowMatchEulerDiscrete (closest equivalent)
Scheduler	beta57	shift=5.0 (approximate)
Face swap	Klein 9B post-processing	Not implemented (future work)
Enhancer	WanAnimateModelEnhancer	Not implemented

Known Issues & Future Work

Nunchaku packing: The SVDQW4A4Linear version produces noise because our naive packing doesn't match Nunchaku's MMA-aligned format. Fix: use convert_to_nunchaku_w4x4y16_linear_state_dict from DeepCompressor (same approach as FLUX.2-Klein).
UMT5 quantization: Currently the text encoder stays in BF16 (10.7 GB). Can be quantized with W4Linear AWQ to ~3 GB (code exists in inference_animate_nunchaku.py). This would eliminate the need for any CPU offloading.
Retargeting + full-body reference: Retargeting improves quality significantly but requires a full-body reference image (close-ups cause zooming).
RIFE frame interpolation: Output is 16fps. Adding RIFE would double to 32fps for smoother video.
Klein face swap: The ComfyUI workflow runs FLUX.2-Klein on output frames to improve face quality. We have the model but didn't implement this step.
LTX-2.3: Deferred. 22B params, ComfyUI-only format, community diffusers conversion available at dg845/LTX-2.3-Diffusers.

Related Work

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for treksis/video-lab

Kling-MotionControl Technical Report

Paper • 2603.03160 • Published Mar 3 • 26