Xiyang Huang

2026

This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.

2025

pdf bib abs

Automatic radiology report generation is pivotal in reducing the workload of radiologists, while simultaneously improving diagnostic accuracy and operational efficiency. Current methods face significant challenges, including the effective alignment of medical visual features with textual features and the mitigation of data bias. In this paper, we propose a method for radiology report generation that utilizes a Cross-modal Enhancement and Alignment Adapter (CmEAA) to connect a vision encoder with a frozen large language model. Specifically, we introduce two novel modules within CmEAA: Cross-modal Feature Enhancement (CFE) and Neural Mutual Information Aligner (NMIA). CFE extracts observation-related contextual features to enhance the visual features of lesions and abnormal regions in radiology images through a cross-modal enhancement transformer. NMIA maximizes neural mutual information between visual and textual representations within a low-dimensional alignment embedding space during training and provides potential global alignment visual representations during inference. Additionally, a weights generator is designed to enable the dynamic adaptation of cross-modal enhanced features and vanilla visual features. Experimental results on two prevailing datasets, namely, IU X-Ray and MIMIC-CXR, demonstrate that the proposed model outperforms previous state-of-the-art methods.

pdf bib abs

Document-level Relation Extraction (DocRE) aims to extract relations from documents. Compared with sentence-level relation extraction, it is necessary to extract long-distance dependencies. Existing methods enhance the output of trained DocRE models either by learning logical rules or by extracting rules from annotated data and then injecting them into the model. However, these approaches can result in suboptimal performance due to incorrect rule set constraints. To mitigate this issue, we propose Context-aware differentiable rule learning or CaDRL for short, a novel differentiable rule-based framework that learns the doc-specific logical rule to avoid generating suboptimal constraints. Specifically, we utilize Transformer-based relation attention to encode document and relation information, thereby learning the contextual information of the relation. We employ a sequence-generated differentiable rule decoder to generate relational probabilistic logic rules at each reasoning step. We also introduce a parameter sharing training mechanism in CaDRL to reconcile the DocRE model and the rule learning module. Extensive experimental results on three DocRE datasets demonstrate that CaDRL outperforms existing rule-based frameworks, significantly improving DocRE performance and making predictions more interpretable and logical.