Wenyang Luo

2025

pdf bib abs
MMATH: A Multilingual Benchmark for Mathematical Reasoning
Wenyang Luo | Xin Zhao | Jing Sha | Shijin Wang | Ji-Rong Wen
Findings of the Association for Computational Linguistics: EMNLP 2025

The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce , a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using , we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue—generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.

2024

Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.It remains a challenging problem to explain the underlying mechanisms by which LLMs process multilingual texts.In this paper, we delve into the composition of Transformer architectures in LLMs to pinpoint language-specific regions.Specially, we propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.Based on LAPE, we conduct comprehensive experiments on several representative LLMs, such as LLaMA-2, BLOOM, and Mistral. Our findings indicate that LLMs’ proficiency in processing a particular language is predominantly due to a small subset of neurons, primarily situated in the models’ top and bottom layers.Furthermore, we showcase the feasibility to “steer” the output language of LLMs by selectively activating or deactivating language-specific neurons. Our research provides important evidence to the understanding and exploration of the multilingual capabilities of LLMs.

To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency. With our library, users can easily reproduce existing methods, train new models, and conduct comprehensive performance comparisons. To rigorously test LLMBox, we conduct extensive experiments in a diverse coverage of evaluation settings, and experimental results demonstrate the effectiveness and efficiency of our library in supporting various implementations related to LLMs. The detailed introduction and usage guidance can be found at https://github.com/RUCAIBox/LLMBox.

Search
Co-authors
Ji-Rong Wen 3
Wayne Xin Zhao 3
Tianyi Tang 2
Yushuo Chen 1
Jie Chen 1
show all...
Xiaoxue Cheng 1
Xia Chunxuan 1
Luran Ding 1
Zican Dong 1
Geyang Guo 1
Haoyang Huang 1
Bingqian Li 1
Junyi Li 1
Yingqian Min 1
Han Peng 1
ZiJing Qin 1
Jing Sha 1
Haoxiang Sun 1
Yiru Tang 1
Xiaolei Wang 1
Jiapeng Wang 1
Yuhao Wang 1
Shijin Wang 1
Furu Wei 1
Shiyi Xu 1
Hu Yiwen 1
Dongdong Zhang 1
Ranchi Zhao 1
Bowen Zheng 1
Kun Zhou 1
Venues
acl2
findings1
Fix author