Shizhe Chen
2023
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Anwen Hu
|
Shizhe Chen
|
Liang Zhang
|
Qin Jin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC.
2019
YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension
Weiying Wang
|
Yongcheng Wang
|
Shizhe Chen
|
Qin Jin
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Multimodal semantic comprehension has attracted increasing research interests recently such as visual question answering and caption generation. However, due to the data limitation, fine-grained semantic comprehension has not been well investigated, which requires to capture semantic details of multimodal contents. In this work, we introduce “YouMakeup”, a large-scale multimodal instructional video dataset to support fine-grained semantic comprehension research in specific domain. YouMakeup contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of natural language descriptions for instructional steps, grounded in temporal video range and spatial facial areas. The annotated steps in a video involve subtle difference in actions, products and regions, which requires fine-grained understanding and reasoning both temporally and spatially. In order to evaluate models’ ability for fined-grained comprehension, we further propose two groups of tasks including generation tasks and visual question answering from different aspects. We also establish a baseline of step caption generation for future comparison. The dataset will be publicly available at https://github.com/AIM3-RUC/YouMakeup to support research investigation in fine-grained semantic comprehension.
Search