• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
成果搜索

author:

Zheng, Xiangtao (Zheng, Xiangtao.) [1] | Zhao, Zhengying (Zhao, Zhengying.) [2] | Song, Baogui (Song, Baogui.) [3] | Li, Hao (Li, Hao.) [4] | Lu, Xiaoqiang (Lu, Xiaoqiang.) [5]

Indexed by:

EI Scopus

Abstract:

With the deep integration of remote sensing technology and artificial intelligence, the human demand for the application of remote sensing data has become increasingly refined. However, single-modal data has limitations in the interpretation of complex scenes. Optical imagery, while rich in spatial information, suffers from weather dependencies; Synthetic Aperture Radar (SAR) data provides all-weather capability but lacks intuitive interpretability; and hyperspectral data, despite its detailed spectral signatures, presents challenges in data redundancy and computational complexity. Therefore, single-modal data is difficult to fully exploit the deeper information in remote sensing images. These limitations underscore the critical need for more advanced analytical approaches that can overcome the constraints of single-modal data interpretation. For this reason, the collaborative analysis of multi-modal data has become a key way to enhance the interpretation capability of remote sensing and is driving the further development of the remote sensing field. By integrating complementary data sources and leveraging their synergistic relationships, multi-modal approaches enable more robust and comprehensive scene interpretation. Among various multi-modal strategies, remote sensing image-text cross-modal understanding has gained particular prominence as it establishes a vital connection between remote sensing image features and human semantic cognition. This framework enhances visual feature representations with the help of text semantic information and achieves cross-modal information complementarity, which significantly improves the performance of remote sensing interpretation. This paper provides a comprehensive examination of remote sensing image-text cross-modal understanding, which is categorized into four main types of tasks: remote sensing image captioning (referred to as image-to-text), text-to-image generation (referred to as text-to-image), remote sensing image-text alignment (referred to as image-text alignment), and remote sensing visual question answering (referred to as image-text dialogue). From the perspective of cross-modal transformation of images and text, remote sensing image-text cross-modal understanding includes image-to-text and text-to-image. In terms of local content interaction, remote sensing image-text cross-modal understanding includes image-text alignment and image-text dialogue. This paper begins with a thorough review of the historical development and current state of image-text cross-modal research. The research progress at home and abroad of four tasks, namely image-to-text, text-to-image, image-text alignment, and image-text dialogue, is reviewed. The key technological breakthroughs in the fields of image-text cross-modal are mainly introduced. On this basis, the in-depth analysis is conducted on the technical difficulties faced by the four tasks. Then, it presents a detailed analysis of commonly used public datasets and evaluation metrics for remote sensing image-text cross-modal understanding. Finally, it summarizes the technical challenges in this field, which mainly include three aspects: modal alignment of remote sensing images and text, cross-modal interpretability, and cross-modal reasoning. Based on the proposed problem of remote sensing image-text cross-modal, it looks forward to the future research directions. (1) In-depth mining of cross-modal information (2) Construction of earth science knowledge graph. (3) Human-computer interaction. (4) Large-scale remote sensing image-text model. (5) Language diversity. (6) Remote sensing multi-source data. This comprehensive investigation not only synthesizes current research but also provides a clear roadmap for future research in remote sensing image-text cross-modal understanding, with potential implications for numerous practical applications in environmental monitoring, urban planning, and disaster management. © 2025 Science Press. All rights reserved.

Keyword:

Alignment Artificial intelligence Composite beams and girders Data integration Geometrical optics Image analysis Image enhancement Image understanding Modal analysis Optical remote sensing Question answering Semantics Text processing

Community:

  • [ 1 ] [Zheng, Xiangtao]College of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
  • [ 2 ] [Zhao, Zhengying]School of Computer Science and Engineering, Xi’an University of Technology, Xi’an; 710048, China
  • [ 3 ] [Zhao, Zhengying]College of Software, Pingdingshan University, Pingdingshan; 467000, China
  • [ 4 ] [Song, Baogui]College of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China
  • [ 5 ] [Li, Hao]Department of Intelligence, Air Force Early Warning Academy, Wuhan; 430019, China
  • [ 6 ] [Lu, Xiaoqiang]College of Physics and Information Engineering, Fuzhou University, Fuzhou; 350108, China

Reprint 's Address:

Email:

Show more details

Related Keywords:

Related Article:

Source :

National Remote Sensing Bulletin

ISSN: 1007-4619

Year: 2025

Issue: 6

Volume: 29

Page: 1566-1586

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count:

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 0

Affiliated Colleges:

Online/Total:146/11092896
Address:FZU Library(No.2 Xuyuan Road, Fuzhou, Fujian, PRC Post Code:350116) Contact Us:0591-22865326
Copyright:FZU Library Technical Support:Beijing Aegean Software Co., Ltd. 闽ICP备05005463号-1