RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models

Luyang Zhang; Shuaimin Li; Yishuo Li; Kunpeng Kang; Kaiyuan Zhang; Cong Wang (王聪); Wenpeng Lu

RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models

Luyang Zhang, Shuaimin Li, Yishuo Li, Kunpeng Kang, Kaiyuan Zhang, Cong Wang, Wenpeng Lu

Abstract

Accurately evaluating the word sense disambiguation (WSD) capabilities of large language models (LLMs) remains challenging, as existing studies primarily rely on single-task evaluations and classification-based metrics that overlook the fundamental differences between generative LLMs and traditional classification models. To bridge this gap, we proposeRoDEval, the first comprehensive evaluation framework specifically tailored for assessing LLM-based WSD methods. RoDEval introduces four novel metrics: Disambiguation Scope, Disambiguation Robustness, Disambiguation Reliability, and Definition Generation Quality Score, enabling a multifaceted evaluation of LLMs’ WSD capabilities. Experimental results using RoDEval across five mainstream LLMs uncover significant limitations in their WSD performance. Specifically, incorrect definition selections in multiple-choice WSD tasks stem not from simple neglect or forget of correct options, but rather from incomplete acquisition of the all senses for polysemous words. Instead, disambiguation reliability is often compromised by the models’ persistent overconfidence. In addition, inherent biases continue to affect performance, and scaling up model parameters alone fails to meaningfully enhance their ability to generate accurate sense definitions. These findings provide actionable insights for enhancing LLMs’ WSD capabilities. The source code and evaluation scripts are open-sourced at https://github.com/DayDream405/RoDEval.

Anthology ID:: 2025.emnlp-main.864
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17095–17126
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.864/
DOI:
Bibkey:
Cite (ACL):: Luyang Zhang, Shuaimin Li, Yishuo Li, Kunpeng Kang, Kaiyuan Zhang, Cong Wang, and Wenpeng Lu. 2025. RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17095–17126, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.864.pdf
Checklist:: 2025.emnlp-main.864.checklist.pdf

PDF Cite Search Checklist Fix data