Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
Abstract
The Knowledge-Aware Visual Question Answering about Entity task aims to disambiguate entities using textual and visual information, as well as knowledge. It usually relies on two independent steps, information retrieval then reading comprehension, that do not benefit each other. Retrieval Augmented Generation (RAG) offers a solution by using generated answers as feedback for retrieval training. RAG usually relies solely on pseudo-relevant passages retrieved from external knowledge bases which can lead to ineffective answer generation. In this work, we propose a multi-level information RAG approach that enhances answer generation through entity retrieval and query expansion. We formulate a joint-training RAG loss such that answer generation is conditioned on both entity and passage retrievals. We show through experiments new state-of-the-art performance on the VIQuAE KB-VQA benchmark and demonstrate that our approach can help retrieve more actual relevant knowledge to generate accurate answers.- Anthology ID:
- 2024.emnlp-main.922
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16499–16513
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.922
- DOI:
- 10.18653/v1/2024.emnlp-main.922
- Cite (ACL):
- Omar Adjali, Olivier Ferret, Sahar Ghannay, and Hervé Le Borgne. 2024. Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16499–16513, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering (Adjali et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.922.pdf