Abstract
We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.- Anthology ID:
- 2024.inlg-main.38
- Volume:
- Proceedings of the 17th International Natural Language Generation Conference
- Month:
- September
- Year:
- 2024
- Address:
- Tokyo, Japan
- Editors:
- Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
- Venue:
- INLG
- SIG:
- SIGGEN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 453–469
- Language:
- URL:
- https://aclanthology.org/2024.inlg-main.38
- DOI:
- Cite (ACL):
- Bram Willemsen and Gabriel Skantze. 2024. Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding. In Proceedings of the 17th International Natural Language Generation Conference, pages 453–469, Tokyo, Japan. Association for Computational Linguistics.
- Cite (Informal):
- Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding (Willemsen & Skantze, INLG 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.inlg-main.38.pdf
- Code
- willemsenbram/reg-with-guiding
- Data
- A Game Of Sorts