CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue

Abbas Ghaddar; David Alfonso-Hermelo; Philippe Langlais; Boxing Chen; Prasanna Parthasarathi

CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue

Abbas Ghaddar, David Alfonso-Hermelo, Philippe Langlais, Boxing Chen, Prasanna Parthasarathi

Abstract

This paper presents CHARPEVAL, a challenging benchmark specifically designed to evaluate the ability of Large Language Models (LLMs) to perform contextualized reasoning in knowledge-grounded dialogue scenarios. The task involves selecting the correct response from 6 options, including 5 manually crafted distractors, given the conversation history. Extensive benchmarking experiments with a diverse set of state-of-the-art open-weight LLMs show poor performance on CHARPEVAL due to their inability to effectively reason over discontinuous chunks of text across the input. Our analysis reveals systematic error patterns across models with different properties, highlighting the need to improve LLMs beyond simply scaling-up data and compute. CHARPEVAL is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP.

Anthology ID:: 2025.findings-acl.860
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16764–16775
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.860/
DOI:
Bibkey:
Cite (ACL):: Abbas Ghaddar, David Alfonso-Hermelo, Philippe Langlais, Boxing Chen, and Prasanna Parthasarathi. 2025. CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16764–16775, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue (Ghaddar et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.860.pdf

PDF Cite Search Fix data