• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
成果搜索

author:

Wang, Wen (Wang, Wen.) [1] | Su, Yanzhou (Su, Yanzhou.) [2] | Gu, Jason (Gu, Jason.) [3]

Indexed by:

Scopus SCIE

Abstract:

Transferring contrastive vision-language pretrained models, such as contrastive language-image pretraining (CLIP), to video recognition task has attracted much attention. Recent studies in this area utilized prompt learning within either text or vision branches, or employed an end-to-end CLIP fine-tuning approach. However, these methods do not fully leverage the learning potential of two branches and may compromise zero-shot generalization. In this work, we present a multimodal framework TransCLIP, aiming to adapt vision-language models by integrating both adapter and prompt tuning techniques for the vision and text encoders. Specifically, we incorporate learnable prompt tokens into each transformer encoder layer's input of vision and text branches, and integrate lightweight adapters into the key and value matrices of the multi-head self-attention modules, enhancing the model's capability to capture more related video-specific features. To effectively leverage temporal information in videos, we implement a temporal difference attention module (TemDiff attn) that explicitly computes differences between adjacent frame embeddings and conducts difference-level attention to encode motion-related temporal dependency in videos. In addition, a coarse-and-fine contrastive leaning strategy is employed to better align the video and text branches, enhancing the learning capability of the whole framework. Across different evaluation settings, our model consistently outperforms previous State-of-the-Art methods on several video action recognition benchmarks.

Keyword:

Action recognition contrastive vision-language models (VLMs) parameter efficient tuning

Community:

  • [ 1 ] [Wang, Wen]Tianmushan Lab, Hangzhou 311115, Peoples R China
  • [ 2 ] [Su, Yanzhou]Fuzhou Univ, Fuzhou 350108, Peoples R China
  • [ 3 ] [Gu, Jason]Dalhousie Univ, Elect & Comp Engn, Halifax, NS B3H 4R2, Canada

Reprint 's Address:

  • [Wang, Wen]Tianmushan Lab, Hangzhou 311115, Peoples R China

Show more details

Version:

Related Keywords:

Online/Total:133/11185448
Address:FZU Library(No.2 Xuyuan Road, Fuzhou, Fujian, PRC Post Code:350116) Contact Us:0591-22865326
Copyright:FZU Library Technical Support:Beijing Aegean Software Co., Ltd. 闽ICP备05005463号-1