Indexed by:
Abstract:
Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most current captioning metrics rely on token-level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric I2CE (Intrinsic Image Captioning Evaluation). For learning the evaluation metric, we develop three progressive model structures capturing the sentence level representations-single branch model, dual branches model, and triple branches model. For evaluation of the proposed metric, we select one automatic captioning model and collect human scores on the quality of the generated captions. We introduce a statistical test on the correlation between human scores and metric scores. Our proposed metric I2CE achieves the Spearman correlation value of 51.42, which is better than the score of 41.95 achieved by one recently proposed BERT-based metric. The result is also better than the conventional rule-based metrics. Extensive results on the Composite-coco dataset and PASCAL-50S also validate the effectiveness of our proposed metric. The proposed metric could serve as a novel indicator of the intrinsic information between captions, which complements the existing ones. (C) 2022 Published by Elsevier Inc.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
INFORMATION SCIENCES
ISSN: 0020-0255
Year: 2022
Volume: 609
Page: 913-930
8 . 1
JCR@2022
0 . 0 0 0
JCR@2023
ESI Discipline: COMPUTER SCIENCE;
ESI HC Threshold:61
JCR Journal Grade:1
CAS Journal Grade:1
Cited Count:
WoS CC Cited Count: 5
SCOPUS Cited Count: 7
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 2
Affiliated Colleges: