C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models

Junru Wu, Tianhao Shen, Linxi Su, Deyi Xiong


Abstract
Large language models (LLMs) have achieved remarkable progress in autonomous reasoning, evolving from basic text processing to sophisticated multimodal reasoning, a critical capability for general-purpose AI assistants. However, existing benchmarks usually fail to adequately capture the intricate multi-step reasoning demands inherent in real-world scenarios. To bridge this gap, we propose **C²RBench**: a **C**hinese **C**omplex **R**easoning **Bench**mark for evaluating multi-step, multimodal advanced reasoning capability of LLMs. C²RBench comprises 1,115 carefully curated Chinese tasks, which are organized into eight domain-specific subsets, each meticulously designed to mirror real-world challenges. This hierarchical benchmark features three difficulty tiers based on the number of reasoning steps required (average 8.44 steps per task), significantly exceeding existing benchmarks in cognitive complexity. Extensive evaluations of 20 LLMs (including DeepSeek-R1) and 24 multimodal large language models (MLLMs) on C²RBench reveal critical performance gaps: GPT-4.1 achieves only 52.11% accuracy, indicating substantial room for improvement. The dataset and evaluation code are publicly available.
Anthology ID:
2025.findings-acl.1083
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21031–21050
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1083/
DOI:
Bibkey:
Cite (ACL):
Junru Wu, Tianhao Shen, Linxi Su, and Deyi Xiong. 2025. C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21031–21050, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models (Wu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1083.pdf