Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs
Chuang Liu, Renren Jin, Zheng Yao, Tianyi Li, Liang Cheng, Mark Steedman, Deyi Xiong
Abstract
Previous benchmarks for evaluating large language models (LLMs) have primarily emphasized quantitative metrics, such as data volume. However, this focus may neglect key qualitative data attributes that can significantly impact the final rankings of LLMs, resulting in unreliable leaderboards. In this paper, we investigate whether current LLM benchmarks adequately consider these data attributes. We specifically examine three attributes: diversity, redundancy, and difficulty. To explore these attributes, we propose a framework with three separate modules, each designed to assess one of the attributes. Using a method that progressively incorporates these attributes, we analyze their influence on the benchmark. Our experimental results reveal a meaningful correlation between LLM rankings on the revised benchmark and the original benchmark when these attributes are accounted for. These findings indicate that existing benchmarks often fail to meet all three criteria, highlighting a lack of consideration for multifaceted data attributes in current evaluation datasets.- Anthology ID:
- 2025.coling-main.403
- Volume:
- Proceedings of the 31st International Conference on Computational Linguistics
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6024–6038
- Language:
- URL:
- https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2025.coling-main.403/
- DOI:
- Cite (ACL):
- Chuang Liu, Renren Jin, Zheng Yao, Tianyi Li, Liang Cheng, Mark Steedman, and Deyi Xiong. 2025. Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6024–6038, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Empirical Study on Data Attributes Insufficiency of Evaluation Benchmarks for LLMs (Liu et al., COLING 2025)
- PDF:
- https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2025.coling-main.403.pdf