Towards Multi-Modal Co-Reference Resolution in Conversational Shopping Agents

Samuel Osebe, Prashan Wanigasekara, Thomas Gueudre, Thanh Tran, Rahul Sharma, Fan Yang, Qian Hu, Weitong Ruan, Emre Barut, Chengwei Su


Abstract
The context of modern smart voice assistants is often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and human in a shopping use case. We propose a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a 4.9% absolute F1 improvement above a cross-attention baseline while reducing the number of parameters being trained by 4x.
Anthology ID:
2024.ecnlp-1.2
Volume:
Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Shervin Malmasi, Besnik Fetahu, Nicola Ueffing, Oleg Rokhlenko, Eugene Agichtein, Ido Guy
Venues:
ECNLP | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
8–18
Language:
URL:
https://aclanthology.org/2024.ecnlp-1.2
DOI:
Bibkey:
Cite (ACL):
Samuel Osebe, Prashan Wanigasekara, Thomas Gueudre, Thanh Tran, Rahul Sharma, Fan Yang, Qian Hu, Weitong Ruan, Emre Barut, and Chengwei Su. 2024. Towards Multi-Modal Co-Reference Resolution in Conversational Shopping Agents. In Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024, pages 8–18, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Towards Multi-Modal Co-Reference Resolution in Conversational Shopping Agents (Osebe et al., ECNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.ecnlp-1.2.pdf