Lightweight Pedestrian Vehicle Detection Algorithm Based on Visible and Infrared Bimodal Fusion - Details

author：

Guo, Cuixia (Guo, Cuixia.) ^[1] | Xu, Yongtao (Xu, Yongtao.) ^[2] | Zou, Zhanghuang (Zou, Zhanghuang.) ^[3] | Pan, Zhijie (Pan, Zhijie.) ^[4] | Huang, Feng (Huang, Feng.) ^[5]

Indexed by：

EI

Abstract：

With　the　continuous　development　of　computer　vision　research,　the　application　of　deep　learning　has　become　more　and　more　widespread.　The　introduction　of　deep　learning　into　object　detection　research　has　greatly　improved　detection　performance.　However,　in　order　to　further　improve　the　detection　accuracy　of　the　model,　the　depth　and　width　of　the　network　are　continuously　increased,　which　leads　to　models　containing　a　large　number　of　parameters　and　complex　structures,　posing　challenges　for　practical　deployment.　Aiming　at　the　problems　of　high　computational　cost,　high　memory　consumption,　and　difficulty　in　efficient　deployment　on　edge　devices　of　visible-infrared　dual-modal　fusion　detection　models,　this　paper　proposes　a　lightweight　pedestrian-vehicle　detection　model　based　on　visible　and　infrared　modality　fusion.　The　multimodal　input　greatly　improves　the　stability　of　the　algorithm　during　all-weather　operation,　and　the　detection　task　can　be　well　accomplished　regardless　of　snowy,　foggy,　or　low-illumination　conditions.　The　proposed　model　uses　the　MobileNetV2　lightweight　network　instead　of　the　YOLOv7-tiny　network　backbone.　The　MobileNetV2　network　employs　an　inverted　residual　structure　and　a　linear　bottleneck　layer,　both　of　which　effectively　enhance　its　feature　representation　and　learning　capability.　It　also　uses　depthwise　separable　convolution,　which,　unlike　conventional　convolution,　divides　the　convolution　into　pointwise　and　depthwise　components.　This　paper　also　proposes　a　differential　modal　fusion　module　inspired　by　the　principle　of　differential　amplification　circuits.　This　module　differentiates　the　two　modal　images,　extracts　the　differential　and　common　mode　information,　and　amplifies　the　differential　mode　information　to　fully　utilize　the　complementary　advantages　of　visible　and　infrared　modalities.　The　illumination-aware　module　is　introduced　due　to　the　fact　that　under　low　illumination　and　bad　weather　conditions,　infrared　and　visible　images　have　different　impacts　on　model　performance.　This　module　dynamically　assigns　weights　to　visible　and　infrared　features　based　on　illumination　conditions,　thus　maximizing　the　feature　information　of　each　modality.　Three　public　datasets　are　used　for　the　experiments:　FLIR　ADAS,　LLVIP,　and　KAIST.　We　conduct　comparative　experiments　between　lightweight　single-modal　algorithms　such　as　YOLOv5s　and　YOLOv7-tiny,　and　dual-modal　algorithms　such　as　ICAFusion　and　CFT.　The　results　show　that　the　bimodal　detection　models　outperform　the　unimodal　detection　models　in　terms　of　detection　performance.　On　the　FLIR　ADAS　dataset,　the　proposed　model　improves　the　detection　accuracy　by　11.6%　and　15.3%,　respectively,　compared　to　the　unimodal　YOLOv5s　and　YOLOv7-tiny　models　with　RGB　input.　Compared　to　the　unimodal　YOLOv5s　and　YOLOv7-tiny　models　with　infrared　input,　the　detection　accuracy　is　improved　by　3.3%　and　4.9%,　respectively.　Compared　with　the　baseline　model,　the　accuracy　of　the　proposed　model　is　improved　by　3.8%.　Compared　with　the　bimodal　models　ICAFusion　and　CFT,　the　proposed　model　improves　the　detection　accuracy　by　3.8%　and　1.9%,　respectively.　On　the　LLVIP　dataset,　the　proposed　model　improves　the　detection　accuracy　by　6.9%　and　1.1%,　respectively,　compared　to　YOLOv7-tiny　when　using　visible　and　infrared　unimodal　inputs.　The　proposed　model　improves　the　detection　accuracy　by　5.4%　and　2.0%,　respectively,　compared　to　YOLOv5s　with　visible　and　infrared　input.　Compared　with　the　baseline　model,　the　accuracy　is　improved　by　1.3%.　Compared　with　the　bimodal　models　ICAFusion　and　SLBAF,　the　proposed　model　improves　the　detection　accuracy　by　9.6%　and　1.5%,　respectively.　On　the　KAIST　dataset,　the　proposed　model　improves　detection　accuracy　by　26.9%　and　7.8%　on　visible　and　infrared　inputs,　respectively,　compared　to　YOLOv7-tiny.　Compared　with　the　baseline　model,　the　accuracy　is　improved　by　3.4%.　Compared　to　other　bimodal　models,　the　proposed　model　shows　the　most　advantageous　results　in　terms　of　detection　accuracy,　achieving　as　high　as　76.2%.　In　terms　of　inference　speed,　the　proposed　model　achieves　208　FPS,　103　FPS,　and　113　FPS　on　the　FLIR　ADAS,　LLVIP,　and　KAIST　datasets,　respectively.　These　results　show　that　the　proposed　model　has　significant　advantages　in　both　detection　accuracy　and　speed,　as　well　as　robustness.　©　2025,　Chinese　Optical　Society.　All　rights　reserved.

Keyword：

Complex networks Computer vision Convolution Fusion reactions Image enhancement Light weight vehicles Modal analysis Network layers Object recognition Signal detection Vehicle detection

Community：

[ 1 ] [Guo, Cuixia]School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou; 350000, China
[ 2 ] [Xu, Yongtao]School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou; 350000, China
[ 3 ] [Zou, Zhanghuang]School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou; 350000, China
[ 4 ] [Pan, Zhijie]School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou; 350000, China
[ 5 ] [Huang, Feng]School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou; 350000, China

Reprint 's Address：

Email：

Related Keywords：

Model for Small Object Detection in Aerial Photography Based on Low Dimensional Image Feature Fusion
2024，Pattern Recognition and Artificial Intelligence
Athlete posture estimation and analysis based on embodied artificial intelligence
2025，Image and Vision Computing
An Applied Study of Image Recognition Using Convolutional Neural Networks
2025，5th IEEE International Conference on Electronic Technology, Communication and Information, ICETCI 2025
Polarization-Aware Dual-Encoder Network for Image Dehazing
2025，Acta Optica Sinica

Type
Departments

All Years Choose Year From to