VoxTNT: A Multi-Scale Transformer-based Approach for 3D Object Detection in Point Clouds; [VoxTNT:基于多尺度 Transformer 的点云 3D 目标检测方法] - Details

author：

Zheng, Q. (Zheng, Q..) ^[1] | Wu, S. (Wu, S..) ^[2] | Wei, J. (Wei, J..) ^[3]

Indexed by：

EI Scopus

Abstract：

[Background]　Traditional　methods,　due　to　their　static　receptive　field　design,　struggle　to　adapt　to　the　significant　scale　differences　among　cars,　pedestrians,　and　cyclists　in　urban　autonomous　driving　scenarios.　Moreover,　cross-scale　feature　fusion　often　leads　to　hierarchical　interference.　[Methodology]　To　address　the　key　challenge　of　cross-scale　representation　consistency　in　3D　object　detection　for　multi-class,　multi-scale　objects　in　autonomous　driving　scenarios,　this　study　proposes　a　novel　method　named　VoxTNT.　VoxTNT　leverages　an　equalized　receptive　field　and　a　local-global　collaborative　attention　mechanism　to　enhance　detection　performance.　At　the　local　level,　a　PointSetFormer　module　is　introduced,　incorporating　an　Induced　Set　Attention　Block　(ISAB)　to　aggregate　fine-grained　geometric　features　from　high-density　point　clouds　through　reduced　cross-attention.　This　design　overcomes　the　information　loss　typically　associated　with　traditional　voxel　mean　pooling.　At　the　global　level,　a　VoxelFormerFFN　module　is　designed,　which　abstracts　non-empty　voxels　into　a　super-point　set　and　applies　cross-voxel　ISAB　interactions　to　capture　long-range　contextual　dependencies.　This　approach　reduces　the　computational　complexity　of　global　feature　learning　from　O(N2)　to　O(M2)　(where　M　＜＜　N,　M　is　the　number　of　non-empty　voxels),　avoiding　the　high　computational　complexity　associated　with　directly　applying　complex　Transformers　to　raw　point　clouds.　This　dual-domain　coupled　architecture　achieves　a　dynamic　balance　between　local　fine-grained　perception　and　global　semantic　association,　effectively　mitigating　modeling　bias　caused　by　fixed　receptive　fields　and　multi-scale　fusion.　[Results]　Experiments　demonstrate　that　the　proposed　method　achieves　a　single-stage　detection　Average　Precision　(AP)　of　59.56%　for　moderate-level　pedestrian　detection　on　the　KITTI　dataset,　an　improvement　of　approximately　12.4%　over　the　SECOND　baseline.　For　two-stage　detection,　it　achieves　a　mean　Average　Precision　(mAP)　of　66.54%,　outperforming　the　second-best　method,　BSAODet,　which　achieves　66.10%.　Validation　on　the　WOD　dataset　further　confirms　the　method’s　effectiveness,　achieving　66.09%　mAP,　which　outperforms　the　SECOND　and　PointPillars　baselines　by　7.7%　and　8.5%,　respectively.　Ablation　studies　demonstrate　that　the　proposed　equalized　local-global　receptive　field　mechanism　significantly　improves　detection　accuracy　for　small　objects.　For　example,　on　the　KITTI　dataset,　full　component　ablation　resulted　in　a　10.8%　and　10.0%　drop　in　AP　for　moderate-level　pedestrian　and　cyclist　detection,　respectively,　while　maintaining　stable　performance　for　large-object　detection.　[Conclusions]　This　study　presents　a　novel　approach　to　tackling　the　challenges　of　multi-scale　object　detection　in　autonomous　driving　scenarios.　Future　work　will　focus　on　optimizing　the　model　architecture　to　further　enhance　efficiency.　©　2025　Science　Press.　All　rights　reserved.

Keyword：

3D object detection autonomous driving intelligent transportation engineering point cloud transformer voxel

Community：

[ 1 ] [Zheng Q.]The College of Computer and Data Science, Fuzhou University, Fuzhou, 350100, China
[ 2 ] [Wu S.]The Academy of Digital China (Fujian), Fuzhou University, Fuzhou, 350100, China
[ 3 ] [Wei J.]The College of Computer and Data Science, Fuzhou University, Fuzhou, 350100, China

Reprint 's Address：

Email：

Show more details

Version：

VoxTNT: A Multi-Scale Transformer-based Approach for 3D Object Detection in Point Clouds
2025，Journal of Geo-Information Science

Related Keywords：

VoxT-GNN: A 3D object detection approach from point cloud based on voxel-level transformer and graph neural network
2025，INFORMATION PROCESSING & MANAGEMENT
Multi-Level Rotational Equivariant Object Detection Network Based on BEV Fusion
2024，Computer Engineering
IRBEVF-Q: Optimization of Image-Radar Fusion Algorithm Based on Bird's Eye View Features
2024，SENSORS
DSC3D: Deformable Sampling Constraints in Stereo 3D Object Detection for Autonomous Driving
2025，IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Source ：

Journal of Geo-Information Science

ISSN： 1560-8999

Year： 2025

Issue： 6

Volume： 27

Page： 1361-1380

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 8

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to