TransCLIP: Transferring Vision-Language Models for Efficient Video Action Recognition - Details

author：

Wang, Wen (Wang, Wen.) ^[1] | Su, Yanzhou (Su, Yanzhou.) ^[2] | Gu, Jason (Gu, Jason.) ^[3]

Indexed by：

EI Scopus SCIE

Abstract：

Transferring　contrastive　vision-language　pretrained　models,　such　as　contrastive　language-image　pretraining　(CLIP),　to　video　recognition　task　has　attracted　much　attention.　Recent　studies　in　this　area　utilized　prompt　learning　within　either　text　or　vision　branches,　or　employed　an　end-to-end　CLIP　fine-tuning　approach.　However,　these　methods　do　not　fully　leverage　the　learning　potential　of　two　branches　and　may　compromise　zero-shot　generalization.　In　this　work,　we　present　a　multimodal　framework　TransCLIP,　aiming　to　adapt　vision-language　models　by　integrating　both　adapter　and　prompt　tuning　techniques　for　the　vision　and　text　encoders.　Specifically,　we　incorporate　learnable　prompt　tokens　into　each　transformer　encoder　layer＇s　input　of　vision　and　text　branches,　and　integrate　lightweight　adapters　into　the　key　and　value　matrices　of　the　multi-head　self-attention　modules,　enhancing　the　model＇s　capability　to　capture　more　related　video-specific　features.　To　effectively　leverage　temporal　information　in　videos,　we　implement　a　temporal　difference　attention　module　(TemDiff　attn)　that　explicitly　computes　differences　between　adjacent　frame　embeddings　and　conducts　difference-level　attention　to　encode　motion-related　temporal　dependency　in　videos.　In　addition,　a　coarse-and-fine　contrastive　leaning　strategy　is　employed　to　better　align　the　video　and　text　branches,　enhancing　the　learning　capability　of　the　whole　framework.　Across　different　evaluation　settings,　our　model　consistently　outperforms　previous　State-of-the-Art　methods　on　several　video　action　recognition　benchmarks.

Keyword：

Action recognition contrastive vision-language models (VLMs) parameter efficient tuning

Community：