Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering

Emil Kalbaliyev, Kairit Sirts


Abstract
Large language models have demonstrated varying levels of competence across a range of reasoning tasks, but coarse-grained evaluations often do not reflect their specific strengths and weaknesses, particularly in complex tasks such as Narrative Question Answering. In this paper, we advocate for a multi-dimensional skill-based evaluation that assesses models across distinct core skill dimensions. Our proposed skill-focused evaluation framework offers a granular and more realistic measure of model performance, revealing targeted areas for improvement and guiding future development. Experiments on Narrative Question Answering demonstrate that dimension-level analysis captures the multifaceted nature of the task and informs more effective model evaluation.
Anthology ID:
2025.starsem-1.34
Volume:
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lea Frermann, Mark Stevenson
Venue:
*SEM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
430–440
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.34/
DOI:
Bibkey:
Cite (ACL):
Emil Kalbaliyev and Kairit Sirts. 2025. Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering. In Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025), pages 430–440, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering (Kalbaliyev & Sirts, *SEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.starsem-1.34.pdf