VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks

Caren Han, Siqu Long, Siwen Luo, Kunze Wang, Josiah Poon


Abstract
Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.
Anthology ID:
2020.coling-main.277
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3107–3117
Language:
URL:
https://aclanthology.org/2020.coling-main.277
DOI:
10.18653/v1/2020.coling-main.277
Bibkey:
Cite (ACL):
Caren Han, Siqu Long, Siwen Luo, Kunze Wang, and Josiah Poon. 2020. VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3107–3117, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks (Han et al., COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2020.coling-main.277.pdf
Code
 usydnlp/VICTR
Data
MS COCO