Mingyu Bai


2026

This paper introduces our approach to SemEval 2026 Task 5, which evaluates the rationality of word-sense scores in ambiguous stories through narrative comprehension. This task requires models to assess the consistency between a given word-sense definition and the meaning of an ambiguous target word in a short narrative context, and to infer a rationality score on a 1-5 scale. We experimented and compared multiple methods. These methods include multi-head ensembles that simulate the behavior of individual annotators, ordinal classification and regression methods that treat scores as ordered categories, and direct regression using mean squared error (MSE) or L1 loss to predict human-average consensus scores. Additionally, we investigated instructional fine-tuning with low-rank adaptation (LoRA) on large language models (LLMs) such as Qwen3-4B-Instruct and Phi-4-mini. Our experimental results show that the direct MSE regression method performs best. This study indicates that directly optimizing to approach human consensus scores is effective for this task, while methods that model individual annotator differences are less applicable.