Hyunwoo Ko


2025

pdf bib
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
Guijin Son | Jiwoo Hong | Hyunwoo Ko | James Thorne
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Scaling pre-training compute has proven effective for achieving multilinguality, but does the same hold for test-time scaling? In this work, we introduce **MCLM**, a multilingual math benchmark featuring competition-level problems in 55 languages. We then compare three test-time scaling methods—Outcome Reward Modeling, Process Reward Modeling, and Budget Forcing. Our findings indicate that although “thinking LLMs” have recently garnered significant attention, their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. More importantly, all tested methods fail to generalize robustly across languages, achieving only modest gains that are smaller than those observed in English, with no improvements in variance or consistency. To foster further research, we release MCLM and MR1-1.5B (a multilingual LLM with reasoning capabilities) and our evaluation results.

pdf bib
Controlling Language Confusion in Multilingual LLMs
Nahyun Lee | Yeongseo Woo | Hyunwoo Ko | Guijin Son
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large language models often suffer from language confusion, a phenomenon in which responses are partially or entirely generated in unintended languages. This critically degrades the user experience, especially in low-resource settings. We hypothesize that this issue stems from limitations in conventional fine-tuning objectives, such as supervised learning, which optimize the likelihood of correct tokens without explicitly penalizing undesired outputs such as cross-lingual mixing. Analysis of loss trajectories during pretraining further reveals that models fail to distinguish between monolingual and language-mixed texts, highlighting the absence of inherent pressure to avoid such confusion. In this work, we apply ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppressing language-confused generations. ORPO maintains strong language consistency, even under high decoding temperatures, while preserving general QA performance. Our findings suggest that incorporating appropriate penalty terms can effectively mitigate language confusion in multilingual models, particularly in low-resource scenarios.

pdf bib
FINKRX: Establishing Best Practices for Korean Financial NLP
Guijin Son | Hyunwoo Ko | Hanearl Jung | Chami Hwang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

In this work, we present the first open leaderboard for evaluating Korean large language models focused on finance. Operated for abouteight weeks, the leaderboard evaluated 1,119 submissions on a closed benchmark covering five MCQA categories: finance and accounting, stock price prediction, domestic company analysis, financial markets, and financial agent tasks and one open-ended qa task. Building on insights from these evaluations, we release an open instruction dataset of 80k instances and summarize widely used training strategies observed among top-performing models. Finally, we introduce FINKRX, a fully open and transparent LLM built using these best practices. We hope our contributions help advance the development of better and safer financial LLMs for Korean and other languages.

pdf bib
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son | Hyunwoo Ko | Dasol Choi
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)