[ECCV 2024] Leveraging Temporal Contextualization for Video Action Recognition

teaser
Official model checkpoints for TC-CLIP
If you like our project, please give us a star ⭐ on Github for the latest update.

Introduction

We present Temporally Contextualized CLIP (TC-CLIP): A novel video understanding framework that leverages holistic video information within its encoding process.

  1. Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by summarizing informative tokens from the entire video into context tokens and leveraging them during the feature encoding process.
  2. Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
  3. Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.

This repository contains all model checkpoints used in our experiments.

Models

We use CLIP ViT-B/16 for all experiments below.

  • (LLM) denotes that the models are using LLM-rephrased category names from FROSTER. Note that experiments on the SSv2 dataset do not involve LLM-rephrasing.
  • (P) denotes that the models are first pretrained on Kinetics-400 and subsequently fine-tuned on each dataset. Otherwise, models are directly fine-tuned from CLIP. See Appendix A in the paper.

Zero-shot action recognition

Scripts HMDB-51 UCF-101 Kinetics-600 Ckpt
TC-CLIP 54.2 ± 0.7 82.9 ± 0.6 75.8 ± 0.5 Link
TC-CLIP (LLM) 56.0 ± 0.3 85.4 ± 0.8 78.1 ± 1.0 Link

Few-shot action recognition

Scripts HMDB-51 UCF-101 SSv2 Ckpt
K=2 / K=4 / K=8 / K=16 K=2 / K=4 / K=8 / K=16 K=2 / K=4 / K=8 / K=16
TC-CLIP 57.3 / 62.3 / 67.3 / 68.6 85.9 / 89.9 / 92.5 / 94.6 7.3 / 8.6 / 9.3 / 14.0 Link
TC-CLIP (LLM) 58.6 / 63.3 / 65.5 / 68.8 86.8 / 90.1 / 92.0 / 94.3 7.3 / 8.6 / 9.3 / 14.0 Link
TC-CLIP (P) 65.3 / 68.5 / 71.4 / 73.0 94.1 / 95.6 / 96.6 / 97.3 8.7 / 10.1 / 12.1 / 15.2 Link

Base-to-novel generalization

Scripts K-400 HMDB-51 UCF-101 SSv2 Ckpt
Base / Novel / HM Base / Novel / HM Base / Novel / HM Base / Novel / HM
TC-CLIP 78.9 / 63.6 / 70.4 73.3 / 54.1 / 62.2 95.5 / 78.0 / 85.9 17.5 / 13.4 / 15.2 Link
TC-CLIP (LLM) 79.1 / 65.4 / 71.6 73.3 / 59.1 / 65.5 95.4 / 81.6 / 88.0 17.5 / 13.4 / 15.2 Link
TC-CLIP (P) N/A 79.4 / 58.3 / 67.2 97.5 / 84.5 / 90.5 19.6 / 15.6 / 17.4 Link

Fully-supervised action recognition

Scripts K-400 (Top-1) K-400 (Top-5) Ckpt
TC-CLIP 85.2 96.9 Link

Citation

If you find TC-CLIP useful in your research, please consider citing our paper:

@article{kim2024tcclip,
  title={Leveraging Temporal Contextualization for Video Action Recognition},
  author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for byminji/TC-CLIP

Finetuned
(52)
this model

Paper for byminji/TC-CLIP