Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

Yue Wan, Yueen Ma, Haoxuan You, Zhecan Wang, Shih-Fu Chang


Abstract
Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks. Existing methods mostly focus on learning the semantic connections between visual objects and linguistic content, which tend to be recognitionlevel information and may not be sufficient for commonsensical reasoning tasks like VCR. In this paper, we propose a novel commonsensical vision-language pre-training framework to bridge the gap. We first augment the conventional image-caption pre-training datasets with commonsense inferences from a visuallinguistic GPT-2. To pre-train models on image, caption and commonsense inferences together, we propose two new tasks: masked commonsense modeling (MCM) and commonsense type prediction (CTP). To reduce the shortcut effect between captions and commonsense inferences, we further introduce the domain-wise adaptive masking that dynamically adjusts the masking ratio. Experimental results on downstream tasks, VCR and VQA, show the improvement of our pre-training strategy over previous methods. Human evaluation also validates the relevance, informativeness, and diversity of the generated commonsense inferences. Overall, we demonstrate the potential of incorporating commonsense knowledge into the conventional recognition-level visual-linguistic pre-training.
Anthology ID:
2022.csrr-1.4
Volume:
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
CSRR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23–35
Language:
URL:
https://aclanthology.org/2022.csrr-1.4
DOI:
10.18653/v1/2022.csrr-1.4
Bibkey:
Cite (ACL):
Yue Wan, Yueen Ma, Haoxuan You, Zhecan Wang, and Shih-Fu Chang. 2022. Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 23–35, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks (Wan et al., CSRR 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2022.csrr-1.4.pdf
Video:
 https://preview.aclanthology.org/paclic-22-ingestion/2022.csrr-1.4.mp4
Data
COCOConceptual CaptionsVCRVisual Question Answering