Xuan Zhao

2026

Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction. The core code and dataset are publicly available at https://github.com/sheunghung/EAR.

pdf bib abs

EduMARS: Can Vision-Language Models Grade Like Teachers? Benchmarking Multimodal, Rubric-Based Assessment on Chinese K-12 Answers
Xuan Zhao | Jiashun Chen | Wanting xu | Huiyuan Yan | Chaowei Fang | Xing Wei
Findings of the Association for Computational Linguistics: ACL 2026

Automated grading of student work is a critical application of AI in education. However, existing benchmarks fall short in evaluating models on realistic, cognitively demanding tasks. Most rely on synthetic, well-structured text inputs, overlooking the multimodal, error-prone, and often handwritten nature of real student responses, especially in K-12 settings. We introduce EduMARS, a multimodal benchmark designed for rubric-aligned evaluation of real Chinese K-12 student answers. The dataset contains over 4,500 authentic responses from high-stakes exams across eight subjects, featuring noisy handwriting,mixed-layout diagrams,mathematical expressions, and narrative reasoning. Each response is meticulously annotated by expert teachers using step-wise scoring rubrics, error classifications, and key-point mappings, providing fine-grained supervision aligned with real-world pedagogical practices. We evaluated existing SOTA MLLMs across the dimensions of final score and the reasoning process of grading, reveals a significant gap between existing SOTA MLLMs and human-level performance. To bridge this performance gap, we propose the Retrieval-Augmented Adaptive-Rubric Grading (RARG), enabling models to emulate expert grading logic by dynamically synthesizing case-specific evaluation schemas. RARG effectively enhances the performance and interpretability of various MLLMs on EduMARS, surpassing in-context learning and chain-of-thought.

Co-authors

Venues

Findings2

Fix author