zzunlp at ClinicalSkillQA: Perceive-and-Plan with Decomposed In-Context Learning and Saliency-Guided Perception for Clinical Skill Keyframe Reordering

Bin Huang, Yi Luo, Zhontian Hua, Guanghui Zhao, Kaixuan Yuan, Kunli Zhang


Abstract
Multimodal Large Language Models (MLLMs)show strong medical visual understanding,however their capability for continuous per-ception in procedural clinical workflows re-mains underexplored. We present Perceive-and-Plan, a decomposed in-context learningparadigm for clinical skill keyframe reorder-ing. The method separates visual perceptionfrom temporal planning via two stages: (1)structured visual perception with saliency-guided Picture-in-Picture (PiP) compositionthat magnifies critical regions (head, chest)as color-coded insets, and (2) temporal rea-soning with chain-style self-verification viafresh conversation reset and visual-evidenceanchoring (BLS Rules R1-R11). Withoutparameter updates, our system scores 71.43overall (2nd place, ClinSkill QA 2026), with0.86 pairwise accuracy and 1.0 rationale cover-age. Structured prompting with visual saliencyguidance measurably improves MLLMs’ pro-cedural understanding.Our code is pub-lished at https://github.com/NanceTide/clinskillqa-perceive-and-plan.
Anthology ID:
2026.bionlp-2.4
Volume:
Proceedings of the BioNLP 2026 (Shared Tasks)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Deepak Gupta, Dina Demner-Fushman
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24–32
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-2.4/
DOI:
Bibkey:
Cite (ACL):
Bin Huang, Yi Luo, Zhontian Hua, Guanghui Zhao, Kaixuan Yuan, and Kunli Zhang. 2026. zzunlp at ClinicalSkillQA: Perceive-and-Plan with Decomposed In-Context Learning and Saliency-Guided Perception for Clinical Skill Keyframe Reordering. In Proceedings of the BioNLP 2026 (Shared Tasks), pages 24–32, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
zzunlp at ClinicalSkillQA: Perceive-and-Plan with Decomposed In-Context Learning and Saliency-Guided Perception for Clinical Skill Keyframe Reordering (Huang et al., BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-2.4.pdf
Supplementarymaterial:
 2026.bionlp-2.4.SupplementaryMaterial.zip