Indexed by:
Abstract:
Transferring contrastive vision-language pretrained models, such as contrastive language-image pretraining (CLIP), to video recognition task has attracted much attention. Recent studies in this area utilized prompt learning within either text or vision branches, or employed an end-to-end CLIP fine-tuning approach. However, these methods do not fully leverage the learning potential of two branches and may compromise zero-shot generalization. In this work, we present a multimodal framework TransCLIP, aiming to adapt vision-language models by integrating both adapter and prompt tuning techniques for the vision and text encoders. Specifically, we incorporate learnable prompt tokens into each transformer encoder layer's input of vision and text branches, and integrate lightweight adapters into the key and value matrices of the multi-head self-attention modules, enhancing the model's capability to capture more related video-specific features. To effectively leverage temporal information in videos, we implement a temporal difference attention module (TemDiff attn) that explicitly computes differences between adjacent frame embeddings and conducts difference-level attention to encode motion-related temporal dependency in videos. In addition, a coarse-and-fine contrastive leaning strategy is employed to better align the video and text branches, enhancing the learning capability of the whole framework. Across different evaluation settings, our model consistently outperforms previous State-of-the-Art methods on several video action recognition benchmarks.
Keyword:
Reprint 's Address:
Version:
Source :
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
ISSN: 1551-3203
Year: 2025
1 1 . 7 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: