2025
pdf
bib
abs
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
Haote Yang
|
Xingjian Wei
|
Jiang Wu
|
Noémi Ligeti-Nagy
|
Jiaxing Sun
|
Yinfan Wang
|
Győző Zijian Yang
|
Junyuan Gao
|
Jingchao Wang
|
Bowen Jiang
|
Shasha Wang
|
Nanjun Yu
|
Zihao Zhang
|
Shixin Hong
|
Hongwei Liu
|
Wei Li
|
Songyang Zhang
|
Dahua Lin
|
Lijun Wu
|
Gábor Prószéky
|
Conghui He
Findings of the Association for Computational Linguistics: ACL 2025
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs’ generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
2024
pdf
bib
abs
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
Jiaxing Sun
|
Weiquan Huang
|
Jiang Wu
|
Chenya Gu
|
Wei Li
|
Songyang Zhang
|
Hang Yan
|
Conghui He
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce CHARM, the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5 representative prompt strategies for improving LLMs’ reasoning ability, such as Chain-of-Thought. Our findings indicated that the LLM’s language orientation and the task’s domain influence the effectiveness of the prompt strategy, which enriches previous research findings. We built closely-interconnected reasoning and memorization tasks, and found that some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar memorization performance. We also evaluated the LLMs’ memorization-independent reasoning abilities and analyzed the typical errors. Our study precisely identified the LLMs’ strengths and weaknesses, providing the clear direction for optimization. It can also serve as a reference for studies in other fields. We will release CHARM at https://github.com/opendatalab/CHARM.