Indexed by:
Abstract:
With the continuous development of computer vision research, the application of deep learning has become more and more widespread. The introduction of deep learning into object detection research has greatly improved detection performance. However, in order to further improve the detection accuracy of the model, the depth and width of the network are continuously increased, which leads to models containing a large number of parameters and complex structures, posing challenges for practical deployment. Aiming at the problems of high computational cost, high memory consumption, and difficulty in efficient deployment on edge devices of visible-infrared dual-modal fusion detection models, this paper proposes a lightweight pedestrian-vehicle detection model based on visible and infrared modality fusion. The multimodal input greatly improves the stability of the algorithm during all-weather operation, and the detection task can be well accomplished regardless of snowy, foggy, or low-illumination conditions. The proposed model uses the MobileNetV2 lightweight network instead of the YOLOv7-tiny network backbone. The MobileNetV2 network employs an inverted residual structure and a linear bottleneck layer, both of which effectively enhance its feature representation and learning capability. It also uses depthwise separable convolution, which, unlike conventional convolution, divides the convolution into pointwise and depthwise components. This paper also proposes a differential modal fusion module inspired by the principle of differential amplification circuits. This module differentiates the two modal images, extracts the differential and common mode information, and amplifies the differential mode information to fully utilize the complementary advantages of visible and infrared modalities. The illumination-aware module is introduced due to the fact that under low illumination and bad weather conditions, infrared and visible images have different impacts on model performance. This module dynamically assigns weights to visible and infrared features based on illumination conditions, thus maximizing the feature information of each modality. Three public datasets are used for the experiments: FLIR ADAS, LLVIP, and KAIST. We conduct comparative experiments between lightweight single-modal algorithms such as YOLOv5s and YOLOv7-tiny, and dual-modal algorithms such as ICAFusion and CFT. The results show that the bimodal detection models outperform the unimodal detection models in terms of detection performance. On the FLIR ADAS dataset, the proposed model improves the detection accuracy by 11.6% and 15.3%, respectively, compared to the unimodal YOLOv5s and YOLOv7-tiny models with RGB input. Compared to the unimodal YOLOv5s and YOLOv7-tiny models with infrared input, the detection accuracy is improved by 3.3% and 4.9%, respectively. Compared with the baseline model, the accuracy of the proposed model is improved by 3.8%. Compared with the bimodal models ICAFusion and CFT, the proposed model improves the detection accuracy by 3.8% and 1.9%, respectively. On the LLVIP dataset, the proposed model improves the detection accuracy by 6.9% and 1.1%, respectively, compared to YOLOv7-tiny when using visible and infrared unimodal inputs. The proposed model improves the detection accuracy by 5.4% and 2.0%, respectively, compared to YOLOv5s with visible and infrared input. Compared with the baseline model, the accuracy is improved by 1.3%. Compared with the bimodal models ICAFusion and SLBAF, the proposed model improves the detection accuracy by 9.6% and 1.5%, respectively. On the KAIST dataset, the proposed model improves detection accuracy by 26.9% and 7.8% on visible and infrared inputs, respectively, compared to YOLOv7-tiny. Compared with the baseline model, the accuracy is improved by 3.4%. Compared to other bimodal models, the proposed model shows the most advantageous results in terms of detection accuracy, achieving as high as 76.2%. In terms of inference speed, the proposed model achieves 208 FPS, 103 FPS, and 113 FPS on the FLIR ADAS, LLVIP, and KAIST datasets, respectively. These results show that the proposed model has significant advantages in both detection accuracy and speed, as well as robustness. © 2025, Chinese Optical Society. All rights reserved.
Keyword:
Reprint 's Address:
Email:
Source :
Acta Photonica Sinica
ISSN: 1004-4213
Year: 2025
Issue: 6
Volume: 54
0 . 6 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 1
Affiliated Colleges: