TransCLIP: Transferring Vision–Language Models for Efficient Video Action Recognition - Details

author：

Wang, W. (Wang, W..) ^[1] | Su, Y. (Su, Y..) ^[2] | Gu, J. (Gu, J..) ^[3]

Indexed by：

Scopus

Abstract：

Transferring　contrastive　vision-language　pretrained　models,　such　as　contrastive　language–image　pretraining　(CLIP),　to　video　recognition　task　has　attracted　much　attention.　Recent　studies　in　this　area　utilized　prompt　learning　within　either　text　or　vision　branches,　or　employed　an　end-to-end　CLIP　fine-tuning　approach.　However,　these　methods　do　not　fully　leverage　the　learning　potential　of　two　branches　and　may　compromise　zero-shot　generalization.　In　this　work,　we　present　a　multimodal　framework　TransCLIP,　aiming　to　adapt　vision–language　models　by　integrating　both　adapter　and　prompt　tuning　techniques　for　the　vision　and　text　encoders.　Specifically,　we　incorporate　learnable　prompt　tokens　into　each　transformer　encoder　layer’s　input　of　vision　and　text　branches,　and　integrate　lightweight　adapters　into　the　key　and　value　matrices　of　the　multi-head　self-attention　modules,　enhancing　the　model’s　capability　to　capture　more　related　video-specific　features.　To　effectively　leverage　temporal　information　in　videos,　we　implement　a　temporal　difference　attention　module　(TemDiff　attn)　that　explicitly　computes　differences　between　adjacent　frame　embeddings　and　conducts　difference-level　attention　to　encode　motion-related　temporal　dependency　in　videos.　In　addition,　a　coarse-and-fine　contrastive　leaning　strategy　is　employed　to　better　align　the　video　and　text　branches,　enhancing　the　learning　capability　of　the　whole　framework.　Across　different　evaluation　settings,　our　model　consistently　outperforms　previous　State-of-the-Art　methods　on　several　video　action　recognition　benchmarks.　©　2005-2012　IEEE.

Keyword：

Action recognition contrastive vision–language models (VLMs) parameter efficient tuning

Community：

[ 1 ] [Wang W.]Tianmushan Laboratory, Hangzhou, 311115, China
[ 2 ] [Su Y.]Fuzhou University, Fuzhou, 350108, China
[ 3 ] [Gu J.]Dalhousie University, Electrical and Computer Engineering, Halifax, B3H 4R2, Canada

Reprint 's Address：

Email：

Show more details

Related Keywords：

TransCLIP: Transferring Vision-Language Models for Efficient Video Action Recognition
2025，IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Temporal-masked skeleton-based action recognition with supervised contrastive learning
2023，SIGNAL IMAGE AND VIDEO PROCESSING
Human action recognition via multi-task learning base on spatial-temporal feature
2015，INFORMATION SCIENCES

Source ：

IEEE Transactions on Industrial Informatics

ISSN： 1551-3203

Year： 2025

1 1 . 7 0 0

JCR@2023

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to