Alignment before Awareness: Towards Visual Question Localized-Answering in Robotic Surgery via Optimal Transport and Answer Semantics

Zhihong Zhu; Yunyan Zhang; Xuxin Cheng; Zhiqi Huang; Derong Xu; Xian Wu; Yefeng Zheng

Alignment before Awareness: Towards Visual Question Localized-Answering in Robotic Surgery via Optimal Transport and Answer Semantics

Zhihong Zhu, Yunyan Zhang, Xuxin Cheng, Zhiqi Huang, Derong Xu, Xian Wu, Yefeng Zheng

Abstract

The visual question localized-answering (VQLA) system has garnered increasing attention due to its potential as a knowledgeable assistant in surgical education. Apart from providing text-based answers, VQLA can also pinpoint the specific region of interest for better surgical scene understanding. Although recent Transformer-based models for VQLA have obtained promising results, they (1) conduct vanilla text-to-image cross attention, leading to unidirectional and coarse-grained alignment; (2) ignore exploiting the semantics of answers to further boost performance. In this paper, we propose a novel model termed OTAS, which first introduces optimal transport to achieve bidirectional and fine-grained alignment between images and questions, enabling more precise localization. Besides, OTAS incorporates a set of learnable candidate answer embeddings to query the probability of each answer class for a given image-question pair. Through Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the answer decision. Extensive experiments on two widely-used benchmark datasets demonstrate the superiority of our model over state-of-the-art methods.

Anthology ID:: 2024.lrec-main.63
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 711–721
Language:
URL:: https://aclanthology.org/2024.lrec-main.63
DOI:
Bibkey:
Cite (ACL):: Zhihong Zhu, Yunyan Zhang, Xuxin Cheng, Zhiqi Huang, Derong Xu, Xian Wu, and Yefeng Zheng. 2024. Alignment before Awareness: Towards Visual Question Localized-Answering in Robotic Surgery via Optimal Transport and Answer Semantics. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 711–721, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Alignment before Awareness: Towards Visual Question Localized-Answering in Robotic Surgery via Optimal Transport and Answer Semantics (Zhu et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-3/2024.lrec-main.63.pdf

PDF Search