Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance

Bryan Etzine, Masoud Hashemi, Nishanth Madhusudhan, Sagar Davasam, Roshnee Sharma, Sathwik Tejaswi Madhusudhan, Vikas Yadav


Abstract
Existing benchmarks are becoming saturated and less effective in evaluating model performance due to factors such as data contamination and the advancing capabilities of the Large Language Models (LLMs). This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric designed to revitalize existing benchmarks. EMDM implements a weighting schema for samples based on their complexity and requisite knowledge, utilizing the performance of a baseline LLM in two experimental setups: “Unguided”, where the model has no prior exposure to test samples, and “Guided”, where the model has prior knowledge about the desired answer. This schema is leveraged in an optimization objective to assign weights to test samples, distinguishing instances of varying complexity. EMDM accounts for both answer correctness and the depth and accuracy of reasoning, offering a more nuanced evaluation of model performance. By weighting test examples based on their required reasoning and knowledge, EMDM achieves a distinguishing range of evaluation scores of 46% among various LLMs, compared to just 17% with traditional exact match (EM) metrics, thereby highlighting the saturation of current evaluation methods.
Anthology ID:
2025.trustnlp-main.33
Volume:
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
511–523
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.33/
DOI:
Bibkey:
Cite (ACL):
Bryan Etzine, Masoud Hashemi, Nishanth Madhusudhan, Sagar Davasam, Roshnee Sharma, Sathwik Tejaswi Madhusudhan, and Vikas Yadav. 2025. Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 511–523, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance (Etzine et al., TrustNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.33.pdf