Yuang Jiang


2025

pdf bib
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Weihao Xuan | Rui Yang | Heli Qi | Qingcheng Zeng | Yunze Xiao | Aosong Feng | Dairui Liu | Yun Xing | Junjue Wang | Fan Gao | Jinghui Lu | Yuang Jiang | Huitao Li | Xin Li | Kunyu Yu | Ruihai Dong | Shangding Gu | Yuekang Li | Xiaofei Xie | Felix Juefei-Xu | Foutse Khomh | Osamu Yoshie | Qingyu Chen | Douglas Teodoro | Nan Liu | Randy Goebel | Lei Ma | Edison Marrese-Taylor | Shijian Lu | Yusuke Iwasawa | Yutaka Matsuo | Irene Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-lingual reasoning abilities. This dual limitation makes it challenging to assess LLMs’ performance in the multilingual setting comprehensively. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-lingual comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, particularly for African languages. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.

2024

pdf bib
Evaluating Large Language Models on Wikipedia-Style Survey Generation
Fan Gao | Hang Jiang | Rui Yang | Qingcheng Zeng | Jinghui Lu | Moritz Blum | Tianwei She | Yuang Jiang | Irene Li
Findings of the Association for Computational Linguistics: ACL 2024

Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.