• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
成果搜索

author:

Wang, W. (Wang, W..) [1] | Su, Y. (Su, Y..) [2] | Gu, J. (Gu, J..) [3]

Indexed by:

Scopus

Abstract:

Transferring contrastive vision-language pretrained models, such as contrastive language–image pretraining (CLIP), to video recognition task has attracted much attention. Recent studies in this area utilized prompt learning within either text or vision branches, or employed an end-to-end CLIP fine-tuning approach. However, these methods do not fully leverage the learning potential of two branches and may compromise zero-shot generalization. In this work, we present a multimodal framework TransCLIP, aiming to adapt vision–language models by integrating both adapter and prompt tuning techniques for the vision and text encoders. Specifically, we incorporate learnable prompt tokens into each transformer encoder layer’s input of vision and text branches, and integrate lightweight adapters into the key and value matrices of the multi-head self-attention modules, enhancing the model’s capability to capture more related video-specific features. To effectively leverage temporal information in videos, we implement a temporal difference attention module (TemDiff attn) that explicitly computes differences between adjacent frame embeddings and conducts difference-level attention to encode motion-related temporal dependency in videos. In addition, a coarse-and-fine contrastive leaning strategy is employed to better align the video and text branches, enhancing the learning capability of the whole framework. Across different evaluation settings, our model consistently outperforms previous State-of-the-Art methods on several video action recognition benchmarks. © 2005-2012 IEEE.

Keyword:

Action recognition contrastive vision–language models (VLMs) parameter efficient tuning

Community:

  • [ 1 ] [Wang W.]Tianmushan Laboratory, Hangzhou, 311115, China
  • [ 2 ] [Su Y.]Fuzhou University, Fuzhou, 350108, China
  • [ 3 ] [Gu J.]Dalhousie University, Electrical and Computer Engineering, Halifax, B3H 4R2, Canada

Reprint 's Address:

Email:

Show more details

Related Keywords:

Source :

IEEE Transactions on Industrial Informatics

ISSN: 1551-3203

Year: 2025

1 1 . 7 0 0

JCR@2023

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count:

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 0

Affiliated Colleges:

Online/Total:196/11213344
Address:FZU Library(No.2 Xuyuan Road, Fuzhou, Fujian, PRC Post Code:350116) Contact Us:0591-22865326
Copyright:FZU Library Technical Support:Beijing Aegean Software Co., Ltd. 闽ICP备05005463号-1