LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting
Watcharitpol Sermsrisuwan, Nopporn Lekuthai, Seksan Yoadsanit, Titipat Achakulvisut
Abstract
This paper presents an approach to localizing visual answers within continuous medical videos using a multi-step multimodal generation pipeline with the MedGenVidQA dataset. We frame visual answer localization as a multimodal fusion problem, integrating raw video, timestamped ASR transcripts, and VLM-generated scene descriptions into structured contextual blocks, enabling the model to cross-reference spoken commentary against observable physical events. We show that targeted guidance, which forces the model to treat audio transcripts as supplementary hints with observable visual movements, significantly outperforms baseline approaches. It achieves state-of-the-art performance on the test leaderboard, yielding an mIoU of 79.55, alongside IoU@0.3, IoU@0.5, and IoU@0.7 scores of 93.75, 90.00, and 77.50, respectively. Our findings highlight the effectiveness of combining multimodal context fusion with targeted guidance to overcome text bias, establishing a promising approach for achieving the micro-level precision required in the medical domain. We release our code on GitHub at https://github.com/biodatlab/medgenvidqa-lamar.- Anthology ID:
- 2026.bionlp-2.31
- Volume:
- Proceedings of the BioNLP 2026 (Shared Tasks)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Deepak Gupta, Dina Demner-Fushman
- Venues:
- BioNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 233–242
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-2.31/
- DOI:
- Cite (ACL):
- Watcharitpol Sermsrisuwan, Nopporn Lekuthai, Seksan Yoadsanit, and Titipat Achakulvisut. 2026. LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting. In Proceedings of the BioNLP 2026 (Shared Tasks), pages 233–242, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting (Sermsrisuwan et al., BioNLP 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-2.31.pdf