机构检索 - 福州大学机构库

1

HSENet: Hierarchical semantic-enriched network for multi-modal image fusion EI SCIE Scopus

期刊论文 | 2026 , 170 | PATTERN RECOGNITION

Liu, Xinyu | Ming, Rui | Du, Songlin | He, Lianghua | Xiao, Guobao

Abstract & Keyword Cite

Abstract ：

In this paper, we propose HSENet, a hierarchical semantic-enriched network capable of generating high-quality fused images with robust global semantic consistency and excellent local detail representation. The core innovation of HSENet lies in its hierarchical enrichment of semantic information through semantic gathering, distribution, and injection. Specifically, the network begins by balancing global information exchange via multi-scale feature aggregation and redistribution while dynamically bridging fusion and segmentation tasks. Following this, a progressive semantic dense injection strategy is introduced, employing dense connections to first inject global semantics into highly consistent infrared features and then propagate the semantic-infrared hybrid features to visible features. This approach effectively enhances semantic representation while minimizing high-frequency information loss. Furthermore, HSENet includes two types of feature fusion modules, to leverage cross-modal attention for more comprehensive feature fusion and utilize semantic features as a third input to further enhance the semantic representation for image fusion. These modules achieve robust and flexible feature fusion in complex scenarios by dynamically balancing global semantic consistency and fine-grained local detail representation. Our approach excels in visual perception tasks while fully preserving the texture features from the source modalities. The comparison experiments of image fusion and semantic segmentation demonstrate the superiority of HSENet in visual quality and semantic preservation. The code is available at https://github.com/Lxyklmyt/HSENet.

Keyword ：

High-level vision task Image fusion Progressive semantic dense injection Semantic gathering and distribution

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Liu, Xinyu , Ming, Rui , Du, Songlin et al. HSENet: Hierarchical semantic-enriched network for multi-modal image fusion [J]. \| PATTERN RECOGNITION , 2026 , 170 .
MLA	Liu, Xinyu et al. "HSENet: Hierarchical semantic-enriched network for multi-modal image fusion" . \| PATTERN RECOGNITION 170 (2026) .
APA	Liu, Xinyu , Ming, Rui , Du, Songlin , He, Lianghua , Xiao, Guobao . HSENet: Hierarchical semantic-enriched network for multi-modal image fusion . \| PATTERN RECOGNITION , 2026 , 170 .
Export to	NoteExpress RIS BibTex

Version ：

2

Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images EI SCIE Scopus

期刊论文 | 2025 , 18 , 976-994 | IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Zhang, Shaofeng | Li, Mengmeng | Zhao, Wufan | Wang, Xiaoqin | Wu, Qunyong

Abstract & Keyword Cite

Abstract ：

Building type information indicates the functional properties of buildings and plays a crucial role in smart city development and urban socioeconomic activities. Existing methods for classifying building types often face challenges in accurately distinguishing buildings between types while maintaining well-delineated boundaries, especially in complex urban environments. This study introduces a novel framework, i.e., CNN-Transformer cross-attention feature fusion network (CTCFNet), for building type classification from very high resolution remote sensing images. CTCFNet integrates convolutional neural networks (CNNs) and Transformers using an interactive cross-encoder fusion module that enhances semantic feature learning and improves classification accuracy in complex scenarios. We develop an adaptive collaboration optimization module that applies human visual attention mechanisms to enhance the feature representation of building types and boundaries simultaneously. To address the scarcity of datasets in building type classification, we create two new datasets, i.e., the urban building type (UBT) dataset and the town building type (TBT) dataset, for model evaluation. Extensive experiments on these datasets demonstrate that CTCFNet outperforms popular CNNs, Transformers, and dual-encoder methods in identifying building types across various regions, achieving the highest mean intersection over union of 78.20% and 77.11%, F1 scores of 86.83% and 88.22%, and overall accuracy of 95.07% and 95.73% on the UBT and TBT datasets, respectively. We conclude that CTCFNet effectively addresses the challenges of high interclass similarity and intraclass inconsistency in complex scenes, yielding results with well-delineated building boundaries and accurate building types.

Keyword ：

Accuracy Architecture Buildings Building type classification CNN-transformer networks cross-encoder Earth Feature extraction feature interaction Optimization Remote sensing Semantics Transformers very high resolution remote sensing Visualization

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Zhang, Shaofeng , Li, Mengmeng , Zhao, Wufan et al. Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images [J]. \| IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING , 2025 , 18 : 976-994 .
MLA	Zhang, Shaofeng et al. "Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images" . \| IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 18 (2025) : 976-994 .
APA	Zhang, Shaofeng , Li, Mengmeng , Zhao, Wufan , Wang, Xiaoqin , Wu, Qunyong . Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images . \| IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING , 2025 , 18 , 976-994 .
Export to	NoteExpress RIS BibTex

Version ：

3

A large-scale VHR parcel dataset and a novel hierarchical semantic boundary-guided network for agricultural parcel delineation EI SCIE Scopus

期刊论文 | 2025 , 221 , 1-19 | ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING

Abstract & Keyword Cite

Abstract ：

Current agricultural parcels (AP) extraction faces two main limitations: (1) existing AP delineation methods fail to fully utilize low-level information (e.g., parcel boundary information), leading to unsatisfactory performance under certain circumstances; (2) the lack of large-scale, high-resolution AP benchmark datasets in China hinders comprehensive model evaluation and improvement. To address the first limitation, we develop a hierarchical semantic boundary-guided network (HBGNet) to fully leverage boundary semantics, thereby improving AP delineation. It integrates two branches, a core branch of AP feature extraction and an auxiliary branch related to boundary feature mining. Specifically, the boundary extract branch employes a module based on Laplace convolution operator to enhance the model's awareness of parcel boundary. For AP feature extraction, a local-global context aggregation module is designed to enhance the semantic representation of AP, improving the adaptability across different AP scenarios. Meanwhile, a boundary-guided module is developed to enhance boundary details of high-level AP semantic information. Ultimately, a multi-grained feature fusion module is designed to enhance the capacity of HBGNet to extract APs with various sizes and shapes. Regarding the second limitation, we construct the first large-scale very high-resolution (VHR) agricultural parcel dataset (FHAPD) across seven different areas, covering more than 10,000 km2, using data from GaoFen-1 (2-meter) and GaoFen-2 (1-meter). Detailed experiments are conducted on the FHAPD, a publicly European dataset (i.e., Al4boundaries), and medium-resolution Sentinel-2 images from the Netherlands and HBGNet is compared with other eight AP delineation methods. Results show that HBGNet outperforms the other eight methods in attribute and geometry accuracy. The Intersection over Union (IOU), F1-score of the boundary (Fbdy), and global total-classification (GTC) exceed other methods by 0.61 %-7.52 %, 0.8 %-36.3 %, and 1.7 %-31.8 %, respectively. It also effectively transfers to unseen regions. We conclude that the proposed HBGNet is an effective, advanced, and transferable method for diverse agricultural scenarios and remote sensing images.

Keyword ：

Agricultural parcel delineation Boundary-guided FHAPD Multitask neural networks Very high-resolution remote sensing images

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Zhao, Hang , Wu, Bingfang , Zhang, Miao et al. A large-scale VHR parcel dataset and a novel hierarchical semantic boundary-guided network for agricultural parcel delineation [J]. \| ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING , 2025 , 221 : 1-19 .
MLA	Zhao, Hang et al. "A large-scale VHR parcel dataset and a novel hierarchical semantic boundary-guided network for agricultural parcel delineation" . \| ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING 221 (2025) : 1-19 .
APA	Zhao, Hang , Wu, Bingfang , Zhang, Miao , Long, Jiang , Tian, Fuyou , Xie, Yan et al. A large-scale VHR parcel dataset and a novel hierarchical semantic boundary-guided network for agricultural parcel delineation . \| ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING , 2025 , 221 , 1-19 .
Export to	NoteExpress RIS BibTex

Version ：

4

Global-Local Fusion With Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images EI SCIE Scopus

期刊论文 | 2025 , 63 | IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Abstract & Keyword Cite

Abstract ：

In recent years, the rapid development of the unmanned aerial vehicle (UAV) technology has generated a large number of aerial photography images captured by UAV. Consequently, the object detection in UAV aerial images has emerged as a recent research focus. However, due to the flexible flight heights and diverse shooting angles of UAV, two significant challenges have arisen in UAV aerial images: extreme variation in target scale and the presence of numerous small targets. To address these challenges, this article introduces a semantic information-guided fusion module specifically tailored for small targets. This module utilizes high-level semantic information to guide and align the underlying texture information, thereby enhancing the semantic representation of small targets at the feature level and subsequently improving the model's ability to detect them. In addition, this article introduces a novel global-local fusion detection strategy to strengthen the detection of small targets. We have redesigned the foreground region assembly method to address the drawbacks of previous methods that involved multiple inferences. Extensive experiments conducted on the VisDrone and UAVDT datasets demonstrate that our two self-designed modules can significantly enhance the detection capability of small targets compared with the YOLOX-M model. Our code is publicly available at: https://github.com/LearnYZZ/GLSDet.

Keyword ：

Accuracy Assembly Autonomous aerial vehicles Decoupled head attention Detectors Feature extraction feature fusion Object detection remote sensing image recognition robust adversarial robustness rotational object detection Semantics Superresolution Technological innovation Transformers

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Chen, Yaxiong , Ye, Zhengze , Sun, Haokai et al. Global-Local Fusion With Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images [J]. \| IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING , 2025 , 63 .
MLA	Chen, Yaxiong et al. "Global-Local Fusion With Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images" . \| IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 63 (2025) .
APA	Chen, Yaxiong , Ye, Zhengze , Sun, Haokai , Gong, Tengfei , Xiong, Shengwu , Lu, Xiaoqiang . Global-Local Fusion With Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images . \| IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING , 2025 , 63 .
Export to	NoteExpress RIS BibTex

Version ：

5

Multiple Region Proposal Experts Network for Wide-Scale Remote Sensing Object Detection EI SCIE Scopus

期刊论文 | 2025 , 63 | IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Abstract & Keyword Cite

Abstract ：

Faced with the wide-scale characteristics of objects in optical remote sensing images, the current object detection models are always unable to provide satisfactory detection capabilities for remote sensing tasks. To achieve better wide-scale coverage for various remote sensing regions of interest, this article introduces a multiprediction mechanism to build a novel region generation model, namely, a multiple region proposal experts network (MRPENet). Meanwhile, to achieve both region proposal coverage and receptive field coverage of wide-scale objects, we constructed a prior design of an anchor (PDA) module and an adaptive features compensation (AFC) module to achieve the coverage of wide-scale remote sensing objects. To better utilize the multiexpert characteristics of our model, we customized a new training sample allocation strategy, dynamic scale-assigned expert learning (DSAEL), to cultivate the ability of experts to deal with objects at various scales. To the best of our knowledge, this is the first time that a multiple region proposal network (RPN) mechanism has been used in the object detection of optical remote sensing images. Extensive experiments have shown the generality and effectiveness of our MRPENet. Without bells and whistles, MRPENet achieves a new state-of-the-art (SOTA) on standard benchmarks, i.e., DOTA-v1.0 [82.02% mean average precision (mAP)], HRSC2016 (98.16% mAP), and FAIR1M-v1.0 (48.80% mAP).

Keyword ：

Adaptation models Adaptive features compensation (AFC) Adaptive systems Detectors dynamic scale-assigned expert learning (DSAEL) Feature extraction multi-prediction mechanism object detection Object detection Optical imaging Proposals remote sensing Remote sensing Semantics Training wide-scale coverage

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Lin, Qifeng , Huang, Haibin , Zhu, Daoye et al. Multiple Region Proposal Experts Network for Wide-Scale Remote Sensing Object Detection [J]. \| IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING , 2025 , 63 .
MLA	Lin, Qifeng et al. "Multiple Region Proposal Experts Network for Wide-Scale Remote Sensing Object Detection" . \| IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 63 (2025) .
APA	Lin, Qifeng , Huang, Haibin , Zhu, Daoye , Chen, Nuo , Fu, Gang , Yu, Yuanlong . Multiple Region Proposal Experts Network for Wide-Scale Remote Sensing Object Detection . \| IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING , 2025 , 63 .
Export to	NoteExpress RIS BibTex

Version ：

6

Contextual Distribution Alignment via Correlation Contrasting for Domain Generalization EI SCIE Scopus

期刊论文 | 2025 , 35 (4) , 3619-3632 | IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Lin, Huibin | Zhang, Chun-Yang | Philip Chen, C. L.

Abstract & Keyword Cite

Abstract ：

Domain generalization aims at learning a model with transferable knowledge from one or more source domain(s) in the presence of domain shift, enabling the model to achieve effective generalization for an unseen target domain. Most existing methods pursue domain-invariant representations of samples to address the challenges of heterogeneous distributions across domains. However, most of such methods are limited to simple data manipulation at the instance level or computing style statistics in feature space for distribution alignment. Such operations fail to effectively capture the contextual semantics across domains from both the intra and inter-views. In this paper, we propose contextual Distribution Alignment via a Contrastive Learning strategy with domain correlation, called DACL, which sufficiently exploits both intra- and inter-domain invariant representations for image domain generalization classification. Specifically, a new Fourier-based augmentation method is developed to capture high-level semantic invariant features. Second, a domain-based feature fusion module is further proposed to increase the diversity of features, which mainly extracts both intra- and inter-domain prototypes via clustering to learn cross-domain representations. Finally, we propose a contrastive learning strategy that takes domain correlation into account, which uses spatial second-order statistics as a metric to measure the relevance between multiple source domains. Extensive experiments are conducted on two domain generalization tasks over six benchmarks, demonstrating that DACL achieves state-of-the-art performance against baseline models. A series of ablation studies are performed and in-depth analyses are conducted in visualization to further verify the rationality and effectiveness of the proposed method.

Keyword ：

Circuits and systems Computational modeling contrastive learning Contrastive learning Correlation Data augmentation Domain generalization domain-invariant representations Feature extraction feature fusion Fourier-based augmentation Measurement Prototypes Representation learning Semantics

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Lin, Huibin , Zhang, Chun-Yang , Philip Chen, C. L. . Contextual Distribution Alignment via Correlation Contrasting for Domain Generalization [J]. \| IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY , 2025 , 35 (4) : 3619-3632 .
MLA	Lin, Huibin et al. "Contextual Distribution Alignment via Correlation Contrasting for Domain Generalization" . \| IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 35 . 4 (2025) : 3619-3632 .
APA	Lin, Huibin , Zhang, Chun-Yang , Philip Chen, C. L. . Contextual Distribution Alignment via Correlation Contrasting for Domain Generalization . \| IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY , 2025 , 35 (4) , 3619-3632 .
Export to	NoteExpress RIS BibTex

Version ：

7

Cross-Modal Haptic Compression Inspired by Embodied AI for Haptic Communications Scopus

期刊论文 | 2025 | IEEE Transactions on Multimedia

Abstract & Keyword Cite

Abstract ：

Haptic data compression has gradually become a key issue for emerging real-time haptic communications in Tactile Internet (TI). However, it is challenging to achieve a tradeoff between high perceptual quality and compression ratio in haptic data compression scheme. Inspired by the perspective of embodied AI, we propose a cross-modal haptic compression scheme for haptic communications to improve the perception quality on TI devices in this paper. Since multimodal fusion is routinely employed to improve the ability of system in cognition, we assume that haptic codec is guided by visual semantics to optimize parameter settings in the coding process. We first design a multi-dimensional tactile feature fusion network (MTFFN) relying on multi-head attention mechanism. The MTFFN extracts the multi-dimensional features from the material surface and maps them to infer the coding parameters. Secondly, we provide second-order difference and linear interpolation to establish an criterion for the determination of optimal codec parameters, which are customized by the material categories so as to give high robustness. Finally, the simulation results reveal that our compression scheme can efficiently make a personalized codec procedure for different materials, obtaining more than 17% improvement in terms of compression ratio with high perceptual quality at the same time. © 2025 IEEE.

Keyword ：

codec cross-modal haptic compression multi-head attention perception

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Lu, H. , Tan, X. , Chen, M. et al. Cross-Modal Haptic Compression Inspired by Embodied AI for Haptic Communications [J]. \| IEEE Transactions on Multimedia , 2025 .
MLA	Lu, H. et al. "Cross-Modal Haptic Compression Inspired by Embodied AI for Haptic Communications" . \| IEEE Transactions on Multimedia (2025) .
APA	Lu, H. , Tan, X. , Chen, M. , Zhang, Z. , Zhang, X. , Chen, J. et al. Cross-Modal Haptic Compression Inspired by Embodied AI for Haptic Communications . \| IEEE Transactions on Multimedia , 2025 .
Export to	NoteExpress RIS BibTex

Version ：

8

Multimodal Knowledge Graph Completion Model Based on Modal Hierarchical Fusion EI Scopus

会议论文 | 2025 , 2344 CCIS , 381-395 | 19th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2024

Abstract & Keyword Cite

Abstract ：

Multimodal Knowledge Graph Completion (MMKGC) involves integrating information from various modalities, such as text and images, into traditional knowledge graphs to enhance their completeness and accuracy. This approach leverages the complementary nature of multimodal data to strengthen the expressive power of knowledge graphs, thereby achieving better performance in tasks like knowledge reasoning and information retrieval. However, knowledge graph completion models designed for the structural information of triples directly applied to the multimodal domain have led to suboptimal model performance. In response to this challenge, this study introduces a novel model called the Multimodal Knowledge Graph Completion Model Based on Modal Hierarchical Fusion (MHF). The MHF model employs a phased fusion strategy that initially learns from structured, visual, and textual modalities independently. Then, it combines structural embeddings with text and image data using a specially designed neural network fusion layer to see how the different types of data interact with each other. Additionally, the MHF model incorporates a semantic constraint layer with a Factor Interaction Regularizer, which enhances the model’s generalization ability by exploiting the semantic equivalence between the head and tail entities of triples. Experimental results on three real-world multimodal benchmark datasets demonstrate that the MHF model achieves excellent performance in link prediction tasks, surpassing the current state-of-the-art baselines, the average performance gain of MRR, Hit@1, and Hit@10 is greater than 5.4%. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Keyword ：

Data fusion Graph embeddings Knowledge graph Semantics

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Zhang, Sirui , Huang, Hao , Lin, Xinyang et al. Multimodal Knowledge Graph Completion Model Based on Modal Hierarchical Fusion [C] . 2025 : 381-395 .
MLA	Zhang, Sirui et al. "Multimodal Knowledge Graph Completion Model Based on Modal Hierarchical Fusion" . (2025) : 381-395 .
APA	Zhang, Sirui , Huang, Hao , Lin, Xinyang , Zheng, Cuichun , Zheng, Zhibo , Wang, Jingbin . Multimodal Knowledge Graph Completion Model Based on Modal Hierarchical Fusion . (2025) : 381-395 .
Export to	NoteExpress RIS BibTex

Version ：

9

Heterogeneous Information Network Embedding Based on Adaptive Meta-Schema Considering Relation Distinction and Semantic Preservation EI Scopus

会议论文 | 2025 , 2343 CCIS , 47-63 | 19th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2024

Wu, Ling | Gao, Pingping | Lu, Jinlu | Guo, Kun | Zhang, Qishan

Abstract & Keyword Cite

Abstract ：

Many real-world networks can be treated as heterogeneous information networks (HINs) that consist of various types of nodes, like different proteins and molecules in biological networks and different authors and papers in citation networks. Multiple network data mining tasks can be conducted on HINs to capture the complex relationships between multi-type nodes. In recent years, random walk based HIN embedding has drawn increasing attention. Furthermore, the meta-path or meta-graph guided random walk is one of the most widely used techniques in HIN embedding methods. However, existing HIN embedding methods still face several difficulties. Firstly, the meta-paths or meta-graphs often need to be predefined, which relies heavily on domain knowledge and incomplete information coverage. Secondly, these methods treat all relations without distinction, which inevitably limits the capability of HIN embedding. Thirdly, they do not focus on preserving finer-grained meta-graph semantics. In this paper, a HIN embedding algorithm based on adaptive meta-schema considering relation distinction and semantic preservation (HINEAS) is proposed. In order to avoid the selection of meta-paths or meta-graphs, an adaptive meta-schema extraction is designed. In heterogeneous node sequence generation, a biased random walk strategy based on the adaptive meta-schema is presented to embed the different relationships’ influence. Finally, an enhanced embedding strategy based on semantic preservation of the adaptive meta-schema is proposed to effectively extract topology and preserve the meta-graph’s fine-grained semantics. Experiments on real-world datasets show that HINEAS significantly outperforms state-of-the-art methods. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Keyword ：

Graph embeddings Graph theory Network embeddings

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Wu, Ling , Gao, Pingping , Lu, Jinlu et al. Heterogeneous Information Network Embedding Based on Adaptive Meta-Schema Considering Relation Distinction and Semantic Preservation [C] . 2025 : 47-63 .
MLA	Wu, Ling et al. "Heterogeneous Information Network Embedding Based on Adaptive Meta-Schema Considering Relation Distinction and Semantic Preservation" . (2025) : 47-63 .
APA	Wu, Ling , Gao, Pingping , Lu, Jinlu , Guo, Kun , Zhang, Qishan . Heterogeneous Information Network Embedding Based on Adaptive Meta-Schema Considering Relation Distinction and Semantic Preservation . (2025) : 47-63 .
Export to	NoteExpress RIS BibTex

Version ：

10

Advances in remote sensing image-text cross-modal understanding EI Scopus

期刊论文 | 2025 , 29 (6) , 1566-1586 | National Remote Sensing Bulletin

Zheng, Xiangtao | Zhao, Zhengying | Song, Baogui | Li, Hao | Lu, Xiaoqiang

Abstract & Keyword Cite

Abstract ：

With the deep integration of remote sensing technology and artificial intelligence, the human demand for the application of remote sensing data has become increasingly refined. However, single-modal data has limitations in the interpretation of complex scenes. Optical imagery, while rich in spatial information, suffers from weather dependencies; Synthetic Aperture Radar (SAR) data provides all-weather capability but lacks intuitive interpretability; and hyperspectral data, despite its detailed spectral signatures, presents challenges in data redundancy and computational complexity. Therefore, single-modal data is difficult to fully exploit the deeper information in remote sensing images. These limitations underscore the critical need for more advanced analytical approaches that can overcome the constraints of single-modal data interpretation. For this reason, the collaborative analysis of multi-modal data has become a key way to enhance the interpretation capability of remote sensing and is driving the further development of the remote sensing field. By integrating complementary data sources and leveraging their synergistic relationships, multi-modal approaches enable more robust and comprehensive scene interpretation. Among various multi-modal strategies, remote sensing image-text cross-modal understanding has gained particular prominence as it establishes a vital connection between remote sensing image features and human semantic cognition. This framework enhances visual feature representations with the help of text semantic information and achieves cross-modal information complementarity, which significantly improves the performance of remote sensing interpretation. This paper provides a comprehensive examination of remote sensing image-text cross-modal understanding, which is categorized into four main types of tasks: remote sensing image captioning (referred to as image-to-text), text-to-image generation (referred to as text-to-image), remote sensing image-text alignment (referred to as image-text alignment), and remote sensing visual question answering (referred to as image-text dialogue). From the perspective of cross-modal transformation of images and text, remote sensing image-text cross-modal understanding includes image-to-text and text-to-image. In terms of local content interaction, remote sensing image-text cross-modal understanding includes image-text alignment and image-text dialogue. This paper begins with a thorough review of the historical development and current state of image-text cross-modal research. The research progress at home and abroad of four tasks, namely image-to-text, text-to-image, image-text alignment, and image-text dialogue, is reviewed. The key technological breakthroughs in the fields of image-text cross-modal are mainly introduced. On this basis, the in-depth analysis is conducted on the technical difficulties faced by the four tasks. Then, it presents a detailed analysis of commonly used public datasets and evaluation metrics for remote sensing image-text cross-modal understanding. Finally, it summarizes the technical challenges in this field, which mainly include three aspects: modal alignment of remote sensing images and text, cross-modal interpretability, and cross-modal reasoning. Based on the proposed problem of remote sensing image-text cross-modal, it looks forward to the future research directions. (1) In-depth mining of cross-modal information (2) Construction of earth science knowledge graph. (3) Human-computer interaction. (4) Large-scale remote sensing image-text model. (5) Language diversity. (6) Remote sensing multi-source data. This comprehensive investigation not only synthesizes current research but also provides a clear roadmap for future research in remote sensing image-text cross-modal understanding, with potential implications for numerous practical applications in environmental monitoring, urban planning, and disaster management. © 2025 Science Press. All rights reserved.

Keyword ：

Alignment Artificial intelligence Composite beams and girders Data integration Geometrical optics Image analysis Image enhancement Image understanding Modal analysis Optical remote sensing Question answering Semantics Text processing

Cite：

Copy from the list or Export to your reference management。

GB/T 7714	Zheng, Xiangtao , Zhao, Zhengying , Song, Baogui et al. Advances in remote sensing image-text cross-modal understanding [J]. \| National Remote Sensing Bulletin , 2025 , 29 (6) : 1566-1586 .
MLA	Zheng, Xiangtao et al. "Advances in remote sensing image-text cross-modal understanding" . \| National Remote Sensing Bulletin 29 . 6 (2025) : 1566-1586 .
APA	Zheng, Xiangtao , Zhao, Zhengying , Song, Baogui , Li, Hao , Lu, Xiaoqiang . Advances in remote sensing image-text cross-modal understanding . \| National Remote Sensing Bulletin , 2025 , 29 (6) , 1566-1586 .
Export to	NoteExpress RIS BibTex

Version ：

Type
Departments

All Years Choose Year From to