Github | Habr article | Project Page | Technical Report (soon)

KVAE 2.0: Video tokenizers

Model KVAE-3D-2.0-t4s16 has time compression 4 and spacial compression 16x16

Evaluation of reconstruction

For the test, open datasets MCL-JCV (video in 1280x720 resolution) and BVI-DVC were used. Wan-2.2 and HunyuanVideo-1.5 were considered as alternatives for the 4x16x16 format. For the HunyuanVideo model, due to the presence of the full attention block, tiling (default parameters) was used. Below are the results of a comparison using the PSNR, SSIM, and LPIPS metrics (with features from AlexNet).

Reconstruction comparison of KVAE 2.0, Hunyuan 1.5 and Wan 2.2

Evaluation of latent space qualities for generation model

The purpose of the tokenizer is to create a latent space for the generative model, so its superiority can only be established by evaluating the quality of the generations. To do this, we directly compared models (side-by-side, SBS) with the participation of several users. Each was shown pairs of images created for the same query. People evaluated each pair according to three characteristics: adherence to promptness, visual and semantic quality. Quite a lot of marked-up pairs allow you to establish a better-worse relationship between a pair of models. The honesty of the comparison is ensured by a fixed training dataset for the generative model, its architecture, as well as the learning strategy (optimizer parameters, number of steps, batch size, and other hyperparameters). Below are the results of two SBS with KVAE-2.0 4x16x16:

Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kandinskylab/KVAE-3D-2.0-t4s16