ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities

Zhaochen Hong, Haofei Yu, Jiaxuan You


Abstract
Evaluating Large Language Models (LLMs) requires effective methods to assess semantic consistency across multiple reversible transformations. Traditional self-consistency methods often fail to capture subtle semantic errors in multi-step tasks. We introduce ConsistencyChecker, a tree-based evaluation framework that measures LLMs’ ability to preserve semantic consistency during reversible transformation processes, sidestepping benchmark data contamination issues. Our approach constructs self-consistency trees where nodes represent text states after transformations (e.g., translation, code modification, paraphrasing) and edges represent pairs of opposite transformations. By analyzing semantic preservation between nodes at different tree depths, ConsistencyChecker quantifies model reliability without requiring manually annotated reference data. Experiments demonstrate that ConsistencyChecker reliably measures generalization abilities across models from 1.5B to 72B parameters. On translation tasks, GPT-4o Mini achieves the highest L3 consistency score of 98.0%. For code generation, Qwen 2.5 32B leads with 85.1% semantic consistency at L3. Results show Pearson correlation greater than 0.7 between our embedding-based scores and WMT 2024 rankings on 4 out of 5 shared language pairs, validating the method’s effectiveness for benchmarking LLM performance without constructing new datasets.
Anthology ID:
2025.acl-long.1585
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
33039–33075
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.acl-long.1585/
DOI:
Bibkey:
Cite (ACL):
Zhaochen Hong, Haofei Yu, and Jiaxuan You. 2025. ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33039–33075, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities (Hong et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.acl-long.1585.pdf