Hilmi Demirhan

2026

UNCC at MedGenVidQA 2026: Structured Temporal Grounding for Medical Video Question Answering
Hilmi Demirhan | Wlodek Zadrozny
Proceedings of the BioNLP 2026 (Shared Tasks)

MedGenVidQA 2026 Task C evaluates visualanswer localization in medical videos. Thesystem receives a video and a question, then returns the start and end time of the visual answer.Our framework used timestamped automaticspeech recognition (ASR) as a proposal sourcerather than as a final boundary label. The framework generated transcript tables, phase maps,lexical and dense candidate windows, schemaconstrained ranking inputs, selective key-framechecks, and a deterministic validation pass forthe final JSON file. The ranker selected amongbounded candidate intervals instead of generating arbitrary timestamps over a full transcript.Each output can be traced to segment identifiers, candidate source families, selected anchors, phase labels, and validation flags. Ourbest run ranked fifth among six participant systems, with 62.50 IoU@0.3, 36.25 IoU@0.5,22.50 IoU@0.7, and 42.57 mIoU. The threshold pattern suggests that coarse temporal retrieval was more reliable than strict start-endlocalization.

Co-authors

Wlodek Zadrozny 1

Venues

BioNLP1
WS1

Fix author