Watcharitpol Sermsrisuwan

2026

LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting
Watcharitpol Sermsrisuwan | Nopporn Lekuthai | Seksan Yoadsanit | Titipat Achakulvisut
Proceedings of the BioNLP 2026 (Shared Tasks)

This paper presents an approach to localizing visual answers within continuous medical videos using a multi-step multimodal generation pipeline with the MedGenVidQA dataset. We frame visual answer localization as a multimodal fusion problem, integrating raw video, timestamped ASR transcripts, and VLM-generated scene descriptions into structured contextual blocks, enabling the model to cross-reference spoken commentary against observable physical events. We show that targeted guidance, which forces the model to treat audio transcripts as supplementary hints with observable visual movements, significantly outperforms baseline approaches. It achieves state-of-the-art performance on the test leaderboard, yielding an mIoU of 79.55, alongside IoU@0.3, IoU@0.5, and IoU@0.7 scores of 93.75, 90.00, and 77.50, respectively. Our findings highlight the effectiveness of combining multimodal context fusion with targeted guidance to overcome text bias, establishing a promising approach for achieving the micro-level precision required in the medical domain. We release our code on GitHub at https://github.com/biodatlab/medgenvidqa-lamar.

2025

pdf bib abs

LAMAR at ArchEHR-QA 2025: Clinically Aligned LLM-Generated Few-Shot Learning for EHR-Grounded Patient Question Answering
Seksan Yoadsanit | Nopporn Lekuthai | Watcharitpol Sermsrisuwan | Titipat Achakulvisut
Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)

Co-authors

Venues

BioNLP2
WS2

Fix author