Task-Aware Evaluation and Error-Overlap Analysis for Large Language Models

Pranava Swaroop Madhyastha

Task-Aware Evaluation and Error-Overlap Analysis for Large Language Models

Abstract

Public leaderboards for large language models often rely on aggregate scores that conceal critical information about model behavior. In this paper, we present a methodology for task-aware evaluation that combines (i) correctness metrics aligned with task semantics compliance checks for instruction-following and numeric equivalence for mathematics with (ii) pairwise error-overlap analysis to identify complementary model pairs. We apply this methodology to 17 outputs of recent state of the art and frontier LLMs across multiple-choice QA, instruction-following, and mathematical reasoning tasks. We observe that task-aware metrics can reorder model rankings relative to generic lexical metrics, and that error-overlap patterns vary substantially across model pairs and scenarios. We finally conclude by discussing implications for model selection, routing strategies, and LLM-as-judge calibration, and release our analysis pipeline to support further investigation.

Anthology ID:: 2025.chomps-main.1
Volume:: Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Aman Sinha, Raúl Vázquez, Timothee Mickus, Rohit Agarwal, Ioana Buhnila, Patrícia Schmidtová, Federica Gamba, Dilip K. Prasad, Jörg Tiedemann
Venues:: CHOMPS | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–10
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.chomps-main.1/
DOI:
Bibkey:
Cite (ACL):: Pranava Madhyastha. 2025. Task-Aware Evaluation and Error-Overlap Analysis for Large Language Models. In Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025), pages 1–10, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Task-Aware Evaluation and Error-Overlap Analysis for Large Language Models (Madhyastha, CHOMPS 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.chomps-main.1.pdf

PDF Cite Search Fix data