Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering

Emil Kalbaliyev; Kairit Sirts

Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering

Abstract

Large language models have demonstrated varying levels of competence across a range of reasoning tasks, but coarse-grained evaluations often do not reflect their specific strengths and weaknesses, particularly in complex tasks such as Narrative Question Answering. In this paper, we advocate for a multi-dimensional skill-based evaluation that assesses models across distinct core skill dimensions. Our proposed skill-focused evaluation framework offers a granular and more realistic measure of model performance, revealing targeted areas for improvement and guiding future development. Experiments on Narrative Question Answering demonstrate that dimension-level analysis captures the multifaceted nature of the task and informs more effective model evaluation.

Anthology ID:: 2025.starsem-1.34
Volume:: Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Lea Frermann, Mark Stevenson
Venue:: *SEM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 430–440
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.34/
DOI:
Bibkey:
Cite (ACL):: Emil Kalbaliyev and Kairit Sirts. 2025. Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering. In Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025), pages 430–440, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering (Kalbaliyev & Sirts, *SEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.34.pdf

PDF Cite Search Fix data