Watcharitpol Sermsrisuwan
2026
LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting
Watcharitpol Sermsrisuwan | Nopporn Lekuthai | Seksan Yoadsanit | Titipat Achakulvisut
Proceedings of the BioNLP 2026 (Shared Tasks)
Watcharitpol Sermsrisuwan | Nopporn Lekuthai | Seksan Yoadsanit | Titipat Achakulvisut
Proceedings of the BioNLP 2026 (Shared Tasks)
This paper presents an approach to localizing visual answers within continuous medical videos using a multi-step multimodal generation pipeline with the MedGenVidQA dataset. We frame visual answer localization as a multimodal fusion problem, integrating raw video, timestamped ASR transcripts, and VLM-generated scene descriptions into structured contextual blocks, enabling the model to cross-reference spoken commentary against observable physical events. We show that targeted guidance, which forces the model to treat audio transcripts as supplementary hints with observable visual movements, significantly outperforms baseline approaches. It achieves state-of-the-art performance on the test leaderboard, yielding an mIoU of 79.55, alongside IoU@0.3, IoU@0.5, and IoU@0.7 scores of 93.75, 90.00, and 77.50, respectively. Our findings highlight the effectiveness of combining multimodal context fusion with targeted guidance to overcome text bias, establishing a promising approach for achieving the micro-level precision required in the medical domain. We release our code on GitHub at https://github.com/biodatlab/medgenvidqa-lamar.