Indexed by:
Abstract:
[Objectives] High-resolution remote sensing image segmentation provides essential data support for urban planning, land use, and land cover analysis by accurately extracting terrain information. However, traditional methods face challenges in predicting object categories at the pixel level due to the high computational cost of processing high-resolution images. Current segmentation approaches often divide remote sensing images into a series of standard blocks and perform multi-scale local segmentation, which captures semantic information at different granularities. However, these methods exhibit weak feature interaction between blocks, as they do not consider contextual prior knowledge, ultimately reducing local segmentation performance. [Methods] To address this issue, this paper proposes a high-resolution remote sensing image segmentation framework named CATrans (Cross-scale Attention Transformer), which combines cross-scale attention with a semantic-based visual Transformer. CATrans first predicts the segmentation results of local blocks and then merges them to produce the final global image segmentation. It introduces contextual prior knowledge to enhance local feature representation. Specifically, we propose a cross-scale attention mechanism to integrate contextual semantic information with multi-level features. The multi-branch parallel structure of the cross-scale attention module enhances focus on objects of varying granularities by analyzing shallow-deep and local-global dependencies. This mechanism aggregates cross-spatial information across various dimensions and weights multi-scale kernels to strengthen multi-level feature representations, enabling the model to avoid deep stacking and multiple sequential processes. Additionally, a semantic-based visual Transformer is adopted to couple multi-level contextual semantic information. Spatial attention is used to reinforce these semantic representations. The multi-level contextual information is grouped to form abstract semantic concepts, which are then fed into the Transformer for sequence modeling. The self-attention mechanism within the Transformer captures dependencies between different positions in the input sequence, thereby enhancing the correlation between contextual semantics and spatial positions. Finally, enhanced contextual semantics are generated through feature mapping. [Results] This paper conducts comparative experiments on the DeepGlobe, Inria Aerial, and LoveDA datasets. The results show that CATrans outperforms existing segmentation methods, including Discrete Wavelet Smooth Network (WSDNet) and Integrating Shallow and Deep Network (ISDNet). CATrans achieves a Mean Intersection over Union (mIoU) of 76.2%, 79.2%, and 54.2%, and a Mean F1 Score (mF1) of 86.5, 87.8%, and 66.8%, with inference speeds of 38.1 FPS, 13.2 FPS, and 95.22 FPS on the respective datasets. Compared to the best-performing method, WSDNet, CATrans improves segmentation performance across all classes, with mIoU gains of 2.1%, 4.0%, and 5.3%, and mF1 gains of 1.3%, 1.8%, and 5.6%. [Conclusions] These findings highlight that the proposed CATrans framework significantly enhances high-resolution remote sensing image segmentation by incorporating contextual prior knowledge to improve local feature representation. It achieves an effective balance between segmentation performance and computational efficiency. © 2025 Science Press. All rights reserved.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
Journal of Geo-Information Science
ISSN: 1560-8999
Year: 2025
Issue: 7
Volume: 27
Page: 1624-1637
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: