CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue
Abbas Ghaddar, David Alfonso-Hermelo, Philippe Langlais, Boxing Chen, Prasanna Parthasarathi
Abstract
This paper presents CHARPEVAL, a challenging benchmark specifically designed to evaluate the ability of Large Language Models (LLMs) to perform contextualized reasoning in knowledge-grounded dialogue scenarios. The task involves selecting the correct response from 6 options, including 5 manually crafted distractors, given the conversation history. Extensive benchmarking experiments with a diverse set of state-of-the-art open-weight LLMs show poor performance on CHARPEVAL due to their inability to effectively reason over discontinuous chunks of text across the input. Our analysis reveals systematic error patterns across models with different properties, highlighting the need to improve LLMs beyond simply scaling-up data and compute. CHARPEVAL is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP.- Anthology ID:
- 2025.findings-acl.860
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16764–16775
- Language:
- URL:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.860/
- DOI:
- Cite (ACL):
- Abbas Ghaddar, David Alfonso-Hermelo, Philippe Langlais, Boxing Chen, and Prasanna Parthasarathi. 2025. CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16764–16775, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue (Ghaddar et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.860.pdf