Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement

Wang Feiyu; Guo Wenyu; Yu Dong (于东); Kang Chen; Liu Pengyuan (刘鹏远)

Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement

Wang Feiyu, Guo Wenyu, Yu Dong, Kang Chen, Liu Pengyuan

Abstract

“The objective of the Chinese Vision-Language Understanding Evaluation (CVLUE) is to comprehensively assess the performance of Chinese vision-language multimodal pre-trained models in multimodal modeling and understanding across four tasks: Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. To enhance the models’ performance across various multimodal tasks, this paper propose a multimodal information understanding enhancement method based on answer-guided images. Firstly, we propose task-specific methods for answer-guided image generation. Secondly, the authentic and answer-guided images are fed into the model for multimodal fine-tuning, respectively. Finally, training objectives are set for different tasks to minimize the gap between the answer-guided images and authentic images, thereby supervising the results produced by the authentic images utlizing answer-guided images. The experimental results demonstrate the effectiveness of the proposed method.”

Anthology ID:: 2024.ccl-3.40
Volume:: Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Month:: July
Year:: 2024
Address:: Taiyuan, China
Editors:: Hongfei Lin, Hongye Tan, Bin Li
Venue:: CCL
SIG:
Publisher:: Chinese Information Processing Society of China
Note:
Pages:: 353–362
Language:: English
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.ccl-3.40/
DOI:
Bibkey:
Cite (ACL):: Wang Feiyu, Guo Wenyu, Yu Dong, Kang Chen, and Liu Pengyuan. 2024. Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations), pages 353–362, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):: Bridging the Gap between Authentic and Answer-Guided Images for Chinese Vision-Language Understanding Enhancement (Feiyu et al., CCL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.ccl-3.40.pdf

PDF Cite Search Fix data