Abstract
Large-scale vision-language pre-training has exhibited strong performance in various visual and textual understanding tasks. Recently, the textual encoders of multi-modal pre-trained models have been shown to generate high-quality textual representations, which often outperform models that are purely text-based, such as BERT. In this study, our objective is to utilize both textual and visual encoders of multi-modal pre-trained models to enhance language understanding tasks. We achieve this by generating an image associated with a textual prompt, thus enriching the representation of a phrase for downstream tasks. Results from experiments conducted on four benchmark datasets demonstrate that our proposed method, which leverages visually-enhanced text representations, significantly improves performance in the entity clustering task.- Anthology ID:
- 2023.findings-acl.363
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5879–5888
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.363
- DOI:
- 10.18653/v1/2023.findings-acl.363
- Cite (ACL):
- Tsu-Yuan Hsu, Chen-An Li, Chao-Wei Huang, and Yun-Nung Chen. 2023. Visually-Enhanced Phrase Understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5879–5888, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Visually-Enhanced Phrase Understanding (Hsu et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2023.findings-acl.363.pdf