Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Guijin Son; Jiwoo Hong; Hyunwoo Ko; James Thorne

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne

Abstract

Scaling pre-training compute has proven effective for achieving multilinguality, but does the same hold for test-time scaling? In this work, we introduce **MCLM**, a multilingual math benchmark featuring competition-level problems in 55 languages. We then compare three test-time scaling methods—Outcome Reward Modeling, Process Reward Modeling, and Budget Forcing. Our findings indicate that although “thinking LLMs” have recently garnered significant attention, their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. More importantly, all tested methods fail to generalize robustly across languages, achieving only modest gains that are smaller than those observed in English, with no improvements in variance or consistency. To foster further research, we release MCLM and MR1-1.5B (a multilingual LLM with reasoning capabilities) and our evaluation results.

Anthology ID:: 2025.acl-long.699
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14333–14368
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.699/
DOI:
Bibkey:
Cite (ACL):: Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. 2025. Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14333–14368, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning (Son et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.699.pdf

PDF Cite Search Fix data