Add installation, inference instructions and citation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,9 +1,12 @@
1
  ---
 
2
  license: mit
3
- base_model:
4
- - Wan-AI/Wan2.1-T2V-1.3B
5
  pipeline_tag: text-to-video
 
 
 
6
  ---
 
7
  <p align="center">
8
  <h1 align="center">HiAR</h1>
9
  <h3 align="center">Hierarchical Autoregressive Video Generation with Pipelined Parallel Inference</h3>
@@ -16,6 +19,77 @@ pipeline_tag: text-to-video
16
 
17
  HiAR proposes **hierarchical denoising** for autoregressive video diffusion models, a paradigm shift from conventional block-first to **step-first** denoising order. By conditioning each block on context at a matched noise level, HiAR maximally attenuates error propagation while preserving temporal causality, achieving **state-of-the-art long video generation** (20s+) with significantly reduced quality drift.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Discussion & Limitations
21
- In essence, this method allows autoregressive video generation to mimic a bidirectional attention video denoising paradigm. For instance, the high-noise denoising stages only require coarse-grained context information. This design maximally reduces error accumulation while theoretically retaining sufficient information to maintain continuity. By scaling the training budget under the constraint of the Forward KL loss, we can achieve near-zero degradation in most scenarios, even enabling infinite generation (e.g., over 200 minutes). However, in some dynamic scenes, inter-frame jumping may still occur. We believe this is not an inherent limitation of the hierarchical denoising paradigm itself, but rather an issue of insufficient capacity in the 1.3B base model, as this denoising paradigm is considerably more challenging. We plan to further validate this paradigm on more powerful base models in the future.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Wan-AI/Wan2.1-T2V-1.3B
3
  license: mit
 
 
4
  pipeline_tag: text-to-video
5
+ tags:
6
+ - video-generation
7
+ - autoregressive
8
  ---
9
+
10
  <p align="center">
11
  <h1 align="center">HiAR</h1>
12
  <h3 align="center">Hierarchical Autoregressive Video Generation with Pipelined Parallel Inference</h3>
 
19
 
20
  HiAR proposes **hierarchical denoising** for autoregressive video diffusion models, a paradigm shift from conventional block-first to **step-first** denoising order. By conditioning each block on context at a matched noise level, HiAR maximally attenuates error propagation while preserving temporal causality, achieving **state-of-the-art long video generation** (20s+) with significantly reduced quality drift.
21
 
22
+ ## Installation
23
+ Create a conda environment and install dependencies:
24
+ ```bash
25
+ conda create -n hiar python=3.10 -y
26
+ conda activate hiar
27
+ pip install -r requirements.txt
28
+ pip install flash-attn --no-build-isolation
29
+ ```
30
+
31
+ ## Quick Start
32
+ ### 1. Download checkpoints
33
+ ```bash
34
+ # Download Wan2.1 base model
35
+ huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
36
+
37
+ # Download HiAR checkpoint
38
+ mkdir -p ckpts
39
+ wget -O ckpts/hiar.pt https://huggingface.co/jackyhate/HiAR/resolve/main/hiar.pt
40
+ ```
41
+
42
+ ### 2. Inference
43
+ Generate short videos (~5 seconds, 21 latent frames):
44
+ ```bash
45
+ python inference.py \
46
+ --config_path configs/hiar.yaml \
47
+ --checkpoint_path ckpts/hiar.pt \
48
+ --data_path data/prompts.txt \
49
+ --output_folder outputs/ \
50
+ --num_output_frames 21 \
51
+ --use_ema \
52
+ --inference_method timestep_first
53
+ ```
54
+
55
+ Generate long videos (~20 seconds, 81 latent frames):
56
+ ```bash
57
+ python inference.py \
58
+ --config_path configs/hiar.yaml \
59
+ --checkpoint_path ckpts/hiar.pt \
60
+ --data_path data/prompts.txt \
61
+ --output_folder outputs/ \
62
+ --num_output_frames 81 \
63
+ --use_ema \
64
+ --inference_method timestep_first
65
+ ```
66
+
67
+ ### Pipelined Parallel Inference
68
+ HiAR's hierarchical denoising structure naturally admits **pipelined parallel inference**: each GPU handles one denoising stage, and blocks flow through a diagonal pipeline schedule.
69
+
70
+ ```bash
71
+ # Requires exactly N GPUs for N denoising steps (default: 4)
72
+ torchrun --nproc_per_node=4 \
73
+ scripts/pipeline_parallel_inference.py \
74
+ --config_path configs/hiar.yaml \
75
+ --checkpoint_path ckpts/hiar.pt \
76
+ --prompt "A cat sitting on a windowsill watching the rain fall outside" \
77
+ --output_path outputs/pipeline_output.mp4 \
78
+ --num_output_frames 81
79
+ ```
80
 
81
  ## Discussion & Limitations
82
+ In essence, this method allows autoregressive video generation to mimic a bidirectional attention video denoising paradigm. For instance, the high-noise denoising stages only require coarse-grained context information. This design maximally reduces error accumulation while theoretically retaining sufficient information to maintain continuity. By scaling the training budget under the constraint of the Forward KL loss, we can achieve near-zero degradation in most scenarios, even enabling infinite generation (e.g., over 200 minutes). However, in some dynamic scenes, inter-frame jumping may still occur. We believe this is not an inherent limitation of the hierarchical denoising paradigm itself, but rather an issue of insufficient capacity in the 1.3B base model, as this denoising paradigm is considerably more challenging. We plan to further validate this paradigm on more powerful base models in the future.
83
+
84
+ ## Citation
85
+ ```bibtex
86
+ @misc{zou2026hiarefficientautoregressivelong,
87
+ title={HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising},
88
+ author={Kai Zou and Dian Zheng and Hongbo Liu and Tiankai Hang and Bin Liu and Nenghai Yu},
89
+ year={2026},
90
+ eprint={2603.08703},
91
+ archivePrefix={arXiv},
92
+ primaryClass={cs.CV},
93
+ url={https://arxiv.org/abs/2603.08703},
94
+ }
95
+ ```