Bryan Etzine

2025

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance
Bryan Etzine | Masoud Hashemi | Nishanth Madhusudhan | Sagar Davasam | Roshnee Sharma | Sathwik Tejaswi Madhusudhan | Vikas Yadav
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

Existing benchmarks are becoming saturated and less effective in evaluating model performance due to factors such as data contamination and the advancing capabilities of the Large Language Models (LLMs). This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric designed to revitalize existing benchmarks. EMDM implements a weighting schema for samples based on their complexity and requisite knowledge, utilizing the performance of a baseline LLM in two experimental setups: “Unguided”, where the model has no prior exposure to test samples, and “Guided”, where the model has prior knowledge about the desired answer. This schema is leveraged in an optimization objective to assign weights to test samples, distinguishing instances of varying complexity. EMDM accounts for both answer correctness and the depth and accuracy of reasoning, offering a more nuanced evaluation of model performance. By weighting test examples based on their required reasoning and knowledge, EMDM achieves a distinguishing range of evaluation scores of 46% among various LLMs, compared to just 17% with traditional exact match (EM) metrics, thereby highlighting the saturation of current evaluation methods.

Co-authors

Vikas Yadav 1

Venues

trustnlp1
ws1

Fix author