PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Paper
β’ 2601.16210 β’ Published
PyraTok has been officially accepted to CVPR 2026! π
This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer.
PyraTok is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a Language-aligned Pyramidal Quantization (LaPQ) module.
@inproceedings{susladkar2026pyratok,
title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}