Jingchao Wang
2026
MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
Yang Shi | Yifeng Xie | Minzhe Guo | Liangsi Lu | Mingxuan Huang | Jingchao Wang | Zhihong Zhu | Boyan Xu | Zhiqi Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yang Shi | Yifeng Xie | Minzhe Guo | Liangsi Lu | Mingxuan Huang | Jingchao Wang | Zhihong Zhu | Boyan Xu | Zhiqi Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 1997 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 12 representative VLMs, and even the best model, Gemini-3-Pro-Preview, classifies the error correctly in only 66.65% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal models.Project Page: https://mmerror-benchmark.github.io
2025
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
Haote Yang | Xingjian Wei | Jiang Wu | Noémi Ligeti-Nagy | Jiaxing Sun | Yinfan Wang | Győző Zijian Yang | Junyuan Gao | Jingchao Wang | Bowen Jiang | Shasha Wang | Nanjun Yu | Zihao Zhang | Shixin Hong | Hongwei Liu | Wei Li | Songyang Zhang | Dahua Lin | Lijun Wu | Gábor Prószéky | Conghui He
Findings of the Association for Computational Linguistics: ACL 2025
Haote Yang | Xingjian Wei | Jiang Wu | Noémi Ligeti-Nagy | Jiaxing Sun | Yinfan Wang | Győző Zijian Yang | Junyuan Gao | Jingchao Wang | Bowen Jiang | Shasha Wang | Nanjun Yu | Zihao Zhang | Shixin Hong | Hongwei Liu | Wei Li | Songyang Zhang | Dahua Lin | Lijun Wu | Gábor Prószéky | Conghui He
Findings of the Association for Computational Linguistics: ACL 2025
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs’ generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .