YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Video Lab β Wan2.2-Animate-14B FP4 Quantization & Inference
Naive FP4 quantization for Wan2.2-Animate-14B, enabling 17B-parameter video generation on a single consumer GPU (24-32 GB VRAM). Includes distill LoRA integration for 6-step inference.
What This Does
Takes a reference character image + a driving dance video β generates a video of the character performing the dance moves. Like Kling 3's motion control, but open-source.
Wan2.2-Animate-14B is Kuaishou's open-source equivalent of Kling 3. It has 17.3B transformer parameters (marketed as "27B MoE" but MoE is pipeline-level, not weight-level β the transformer itself has no expert routing).
Key Results
| Method | Time (81 frames) | VRAM | Steps |
|---|---|---|---|
| BF16 (sequential CPU offload) | 385s | 5.5 GB | 20 |
| Naive FP4 (model CPU offload) | 272s | 15.6 GB | 20 |
| FP4 + Distill LoRA | 113s | 15.6 GB | 6 |
- 480x832 portrait resolution, 81 frames at 16fps (~5 seconds of video)
- RTX 5090 (32 GB VRAM), 62 GB RAM
- FP4 quality is visually identical to BF16
Quick Start
Prerequisites
pip install diffusers transformers accelerate safetensors sentencepiece
pip install torch # >= 2.4.0
pip install decord onnxruntime loguru moviepy opencv-python-headless matplotlib ftfy
1. Download the BF16 model (one-time, ~82 GB)
huggingface-cli download Wan-AI/Wan2.2-Animate-14B-Diffusers --local-dir Wan2.2-Animate-14B-Diffusers
2. Download preprocessing checkpoints (one-time)
# From our repo (animation mode only β det + pose2d, ~2.5 GB):
huggingface-cli download treksis/video-lab --include "process_checkpoint/**" --local-dir .
# Or from official source (includes SAM2 for replacement mode, ~4 GB total):
# huggingface-cli download Wan-AI/Wan2.2-Animate-14B --include "process_checkpoint/**" --local-dir Wan2.2-Animate-14B-orig
3. Download pre-quantized weights (or re-quantize yourself)
# Option A: Download pre-quantized transformer (9.9 GB) + text encoder (6.66 GB)
huggingface-cli download treksis/video-lab --include "Wan2.2-Animate-14B-FP4/**" --local-dir .
# Option B: Re-quantize from BF16 (takes ~30 seconds on GPU)
python quantize_naive_fp4.py --model-dir Wan2.2-Animate-14B-Diffusers/transformer --output-dir Wan2.2-Animate-14B-FP4
4. Download distill LoRA (optional, for 6-step inference)
# From our repo (mirrored):
huggingface-cli download treksis/video-lab --include "distill_lora/**" --local-dir .
# Or from original source:
# huggingface-cli download Kijai/WanVideo_comfy Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors --local-dir distill_lora
5. Clone Wan2.2 repo (for preprocessing scripts)
git clone --depth 1 https://github.com/Wan-Video/Wan2.2.git
6. Preprocess your driving video
python preprocess_animate.py \
--ckpt-path Wan2.2-Animate-14B-orig/process_checkpoint \
--video-path your_dance_video.mp4 \
--refer-path your_character_image.jpg \
--save-path preprocess_output \
--width 480 --height 832 --fps 16
With retargeting (adapts skeleton proportions to reference β use full-body reference image!):
python preprocess_animate.py \
--ckpt-path Wan2.2-Animate-14B-orig/process_checkpoint \
--video-path your_dance_video.mp4 \
--refer-path your_character_fullbody.jpg \
--save-path preprocess_output \
--width 480 --height 832 --fps 16 \
--retarget
Note on retargeting: The retargeted output matches the framing of the reference image. If your reference is a close-up portrait, the output will be zoomed in. Use a full-body standing photo for full-body dance output.
7. Run inference
FP4 + Distill LoRA (recommended, 6 steps, ~113s):
python inference_animate_distill.py
FP4 only (20 steps, ~272s):
python inference_animate_fp4.py
BF16 baseline (20 steps, ~385s, needs 60+ GB RAM):
python inference_animate.py
File Reference
Scripts
| File | Description |
|---|---|
quantize_naive_fp4.py |
FP4 RTN quantization. GPU-accelerated, takes ~30s. Also contains FP4Linear class for inference. |
preprocess_animate.py |
Preprocessing: extracts pose skeleton + face crops from driving video. Uses YOLOv10+ViTPose from Wan's official checkpoint. Supports optional pose retargeting. |
inference_animate.py |
BF16 baseline inference with sequential CPU offloading. |
inference_animate_fp4.py |
Naive FP4 inference with model-level CPU offloading. |
inference_animate_distill.py |
Best version. Merges lightx2v distill LoRA β FP4 quantization β 6-step inference. |
inference_animate_nunchaku.py |
Nunchaku SVDQW4A4Linear attempt. Fast (1s/step) but output is noise due to incorrect weight packing. Needs DeepCompressor converter for correct MMA-aligned format. |
preprocess_dwpose.py |
DWPose attempt (incomplete β needs mmcv/mmpose/mmdet). |
Weights
| File | Size | Description |
|---|---|---|
Wan2.2-Animate-14B-FP4/quantized_transformer.safetensors |
9.9 GB | Pre-quantized FP4 transformer (without distill LoRA merged). 525 linear layers quantized, group_size=32. |
Wan2.2-Animate-14B-FP4/quantized_text_encoder.safetensors |
6.66 GB | INT4 AWQ quantized UMT5 text encoder. 168 linear layers quantized, group_size=128. Config embedded in metadata. |
Wan2.2-Animate-14B-FP4/quantize_config.json |
<1 KB | Quantization metadata. |
distill_lora/lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors |
704 MB | Distillation LoRA for 6-step inference. Originally from Kijai/WanVideo_comfy. ComfyUI format (lora_down/lora_up + diff_b). |
process_checkpoint/det/yolov10m.onnx |
59 MB | YOLOv10m person detector (ONNX). Used by preprocess_animate.py. |
process_checkpoint/pose2d/vitpose_h_wholebody.onnx/ |
2.4 GB | ViTPose-H wholebody pose estimator (ONNX, external data format β 394 files). Used by preprocess_animate.py. |
Documentation
| File | Description |
|---|---|
README.md |
This file. |
RESEARCH_NOTES.md |
Detailed research notes: architecture analysis, Wan2.2 model variants, LTX-2.3 analysis, Kling 3 relationship, performance benchmarks. |
Architecture Notes
Wan2.2-Animate-14B Transformer
- 17.3B parameters (not 27B β MoE routing is at pipeline level)
- 40 transformer blocks, each: self-attention + cross-attention + FFN
- 525 quantizable linear layers
- inner_dim = 5120 (40 heads x 128 dim)
- ffn_dim = 13824
- Text encoder: UMT5 (~10.7 GB, NOT the same as FLUX.1's T5-XXL)
- Image encoder: CLIPVisionModel (~1.2 GB)
- VAE: AutoencoderKLWan (~0.5 GB)
- Pipeline class:
WanAnimatePipeline(diffusers)
VRAM Budget
| Component | BF16 | FP4 |
|---|---|---|
| Transformer | 34.5 GB | ~14 GB (9.9 GB weights + dequant overhead) |
| Text encoder (UMT5) | 10.7 GB | 10.7 GB (or ~3 GB with INT4 AWQ) |
| Image encoder (CLIP) | 1.2 GB | 1.2 GB |
| VAE | 0.5 GB | 0.5 GB |
| Total | ~47 GB | ~16-26 GB |
BF16 doesn't fit on any consumer GPU β needs sequential CPU offloading (slow). FP4 fits in 24-32 GB VRAM β model-level offloading or no offloading (fast).
Naive FP4 Quantization
- Round-To-Nearest (RTN) to FP4 E2M1 format
- Per-group scales with group_size=32
- Packed as uint8 (2 x 4-bit values per byte)
- At inference time: dequantize to BF16 before matmul (fake quantization)
- No calibration data needed β quantizes in ~30 seconds on GPU
Distill LoRA (lightx2v)
- Trained for I2V but works on Animate (same base architecture)
- ComfyUI format with
lora_down/lora_up+diff_b(bias diffs) - Merged into BF16 weights before FP4 quantization
- Scheduler:
FlowMatchEulerDiscreteSchedulerwithshift=5.0 - Reduces 20 steps β 6 steps with comparable quality
- Key mapping: ComfyUI
self_attn.qβ diffusersattn1.to_q, etc.
Preprocessing Pipeline
Animation mode (what we use):
- YOLOv10m detects person bounding box
- ViTPose-H extracts wholebody keypoints (body + hands + face)
- Draw skeleton on black canvas β
src_pose.mp4 - Crop face region β
src_face.mp4 - (Optional) Retarget pose proportions to reference image
Does NOT need (only for replacement mode):
- SAM2 (segmentation)
- FluxKontext (only for retarget+edit mode)
ComfyUI Workflow Analysis (ICY WAN ANIMATE)
The ICY ComfyUI workflow uses these tricks we replicated:
| Feature | ComfyUI | Our Implementation |
|---|---|---|
| Model | Wan2.2-Animate-14B FP8 | Same model, naive FP4 |
| Distill LoRA | lightx2v rank64/128 | lightx2v rank64, merged before quant |
| Steps | 6 | 6 |
| Sampler | res_2s | FlowMatchEulerDiscrete (closest equivalent) |
| Scheduler | beta57 | shift=5.0 (approximate) |
| Face swap | Klein 9B post-processing | Not implemented (future work) |
| Enhancer | WanAnimateModelEnhancer | Not implemented |
Known Issues & Future Work
Nunchaku packing: The SVDQW4A4Linear version produces noise because our naive packing doesn't match Nunchaku's MMA-aligned format. Fix: use
convert_to_nunchaku_w4x4y16_linear_state_dictfrom DeepCompressor (same approach as FLUX.2-Klein).UMT5 quantization: Currently the text encoder stays in BF16 (10.7 GB). Can be quantized with W4Linear AWQ to ~3 GB (code exists in
inference_animate_nunchaku.py). This would eliminate the need for any CPU offloading.Retargeting + full-body reference: Retargeting improves quality significantly but requires a full-body reference image (close-ups cause zooming).
RIFE frame interpolation: Output is 16fps. Adding RIFE would double to 32fps for smoother video.
Klein face swap: The ComfyUI workflow runs FLUX.2-Klein on output frames to improve face quality. We have the model but didn't implement this step.
LTX-2.3: Deferred. 22B params, ComfyUI-only format, community diffusers conversion available at
dg845/LTX-2.3-Diffusers.