Indexed by:
Abstract:
Cross-modal remote sensing image-audio (RSIA) retrieval aims to use audio or remote sensing images (RSIs) as queries to retrieve relevant RSIs or corresponding audios. Although many approaches leverage labeled samples to achieve good performance, the performance cost of labeled samples is high, because cross-modal remote sensing (RS) labeled samples usually require huge labor resources. Therefore, unsupervised cross-modal learning is very important in real-world applications. In this article, we propose a novel unsupervised cross-modal RSIA retrieval approach, named self-supervision interactive alignment (SSIA), which can take advantage of large amounts of unlabeled samples to learn the salient information, cross-modal alignment, and the similarity between RSIs and audios. Since self-supervised learning lacks the supervision of label information, we leverage the similarity between the input RSI information and audio information as the supervision information. Besides, to perform cross-modal alignment, a novel interactive alignment (IA) module is designed to explore fine correspondence relation for RSIs and audios. Moreover, we design an audio-guided image de-redundant module to reduce the redundant information of visual information, which can capture salient information of RSIs. Extensive experiments on four widely used RSIA datasets testify that the SSIA performance gains better RSIA retrieval performance than other compared approaches.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
ISSN: 0196-2892
Year: 2023
Volume: 61
7 . 5
JCR@2023
7 . 5 0 0
JCR@2023
ESI Discipline: GEOSCIENCES;
ESI HC Threshold:26
JCR Journal Grade:1
CAS Journal Grade:1
Affiliated Colleges: