RViT: Robust Fusion Vision Transformer with Variational Hierarchical Denoising Process for Image Classification - Details

author：

Lin, Z. (Lin, Z..) ^[1] | Wu, Y. (Wu, Y..) ^[2] | Chen, J. (Chen, J..) ^[3] | Wang, S. (Wang, S..) ^[4] (Scholars：王石平)

Indexed by：

Scopus

Abstract：

Transformers　designed　for　natural　language　processing　have　originally　been　explored　for　computer　vision　in　recent　research.　Various　Vision　Transformers　(ViTs)　play　an　increasingly　important　role　in　the　field　of　image　tasks　such　as　computer　vision,　multimodal　fusion　and　multimedia　analysis.　However,　to　obtain　promising　performance,　most　existing　ViTs　usually　rely　on　artificially　filtered　high-quality　images,　which　may　suffer　from　inherent　noise　risk.　Generally,　such　well-constructed　images　are　not　always　available　in　every　situation.　To　this　end,　we　propose　a　Robust　ViT　(RViT)　to　focus　on　the　relevant　and　robust　representation　learning　for　image　classification　tasks.　Specifically,　we　first　develop　a　novel　Denoising　VTUnet　module,　where　we　conceptualize　the　nonrobust　noise　as　the　uncertainty　under　the　variational　conditions.　Furthermore,　we　design　a　fusion　transformer　backbone　with　a　tailored　fusion　attention　mechanism　to　perform　image　classification　based　on　the　extracted　robust　representations　effectively.　To　demonstrate　the　superiority　of　our　model,　the　compared　experiments　are　conducted　on　several　popular　datasets.　Benefiting　from　the　sequence　regularity　of　the　Transformer　and　captured　robust　feature,　the　proposed　method　exceeds　compared　Transformer-based　models　with　superior　performance　in　visual　tasks.　　©　2024　Technical　Committee　on　Guidance,　Navigation　and　Control,　CSAA.

Keyword：

fusion attention Image classification robust representation learning variational inference vision transformer

Community：

[ 1 ] [Lin Z.]College of Computer and Data Science, Fuzhou University, Fuzhou, 350108, China
[ 2 ] [Lin Z.]Key Laboratory of Intelligent Metro, Fujian Province University, Fuzhou, 350108, China
[ 3 ] [Wu Y.]College of Computer and Data Science, Fuzhou University, Fuzhou, 350108, China
[ 4 ] [Wu Y.]Key Laboratory of Intelligent Metro, Fujian Province University, Fuzhou, 350108, China
[ 5 ] [Chen J.]College of Computer and Data Science, Fuzhou University, Fuzhou, 350108, China
[ 6 ] [Chen J.]Key Laboratory of Intelligent Metro, Fujian Province University, Fuzhou, 350108, China
[ 7 ] [Wang S.]College of Computer and Data Science, Fuzhou University, Fuzhou, 350108, China
[ 8 ] [Wang S.]Key Laboratory of Intelligent Metro, Fujian Province University, Fuzhou, 350108, China

Reprint 's Address：

Email：

Show more details

Related Keywords：

RViT: Robust Fusion Vision Transformer with Variational Hierarchical Denoising Process for Image Classification
2024，GUIDANCE NAVIGATION AND CONTROL
Deep Image Annotation and Classification by Fusing Multi-Modal Semantic Topics
2018，KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS
RA-ViT:Patch-wise Radially-Accumulate Module for ViT in Hyperspectral Image Classification
2022，2022 6th International Conference on Machine Vision and Information Technology, CMVIT 2022
CVT-HNet: a fusion model for recognizing perianal fistulizing Crohn’s disease based on CNN and ViT
2025，BMC Medical Imaging

Source ：

Guidance, Navigation and Control

ISSN： 2737-4807

Year： 2024

Issue： 3

Volume： 4

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 2

Affiliated Colleges：

计算机与大数据学院、软件学院本学院/部未明确归属的数据

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to