Leveraging Temporal Contextualization for Video Action Recognition
Paper
• 2404.09490 • Published
We present Temporally Contextualized CLIP (TC-CLIP): A novel video understanding framework that leverages holistic video information within its encoding process.
This repository contains all model checkpoints used in our experiments.
We use CLIP ViT-B/16 for all experiments below.
| Scripts | HMDB-51 | UCF-101 | Kinetics-600 | Ckpt |
|---|---|---|---|---|
| TC-CLIP | 54.2 ± 0.7 | 82.9 ± 0.6 | 75.8 ± 0.5 | Link |
| TC-CLIP (LLM) | 56.0 ± 0.3 | 85.4 ± 0.8 | 78.1 ± 1.0 | Link |
| Scripts | HMDB-51 | UCF-101 | SSv2 | Ckpt |
|---|---|---|---|---|
| K=2 / K=4 / K=8 / K=16 | K=2 / K=4 / K=8 / K=16 | K=2 / K=4 / K=8 / K=16 | ||
| TC-CLIP | 57.3 / 62.3 / 67.3 / 68.6 | 85.9 / 89.9 / 92.5 / 94.6 | 7.3 / 8.6 / 9.3 / 14.0 | Link |
| TC-CLIP (LLM) | 58.6 / 63.3 / 65.5 / 68.8 | 86.8 / 90.1 / 92.0 / 94.3 | 7.3 / 8.6 / 9.3 / 14.0 | Link |
| TC-CLIP (P) | 65.3 / 68.5 / 71.4 / 73.0 | 94.1 / 95.6 / 96.6 / 97.3 | 8.7 / 10.1 / 12.1 / 15.2 | Link |
| Scripts | K-400 | HMDB-51 | UCF-101 | SSv2 | Ckpt |
|---|---|---|---|---|---|
| Base / Novel / HM | Base / Novel / HM | Base / Novel / HM | Base / Novel / HM | ||
| TC-CLIP | 78.9 / 63.6 / 70.4 | 73.3 / 54.1 / 62.2 | 95.5 / 78.0 / 85.9 | 17.5 / 13.4 / 15.2 | Link |
| TC-CLIP (LLM) | 79.1 / 65.4 / 71.6 | 73.3 / 59.1 / 65.5 | 95.4 / 81.6 / 88.0 | 17.5 / 13.4 / 15.2 | Link |
| TC-CLIP (P) | N/A | 79.4 / 58.3 / 67.2 | 97.5 / 84.5 / 90.5 | 19.6 / 15.6 / 17.4 | Link |
If you find TC-CLIP useful in your research, please consider citing our paper:
@article{kim2024tcclip,
title={Leveraging Temporal Contextualization for Video Action Recognition},
author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
journal={European Conference on Computer Vision (ECCV)},
year={2024}
}
Base model
openai/clip-vit-base-patch16