2025
pdf
bib
abs
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
Guijin Son
|
Jiwoo Hong
|
Hyunwoo Ko
|
James Thorne
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Scaling pre-training compute has proven effective for achieving multilinguality, but does the same hold for test-time scaling? In this work, we introduce **MCLM**, a multilingual math benchmark featuring competition-level problems in 55 languages. We then compare three test-time scaling methods—Outcome Reward Modeling, Process Reward Modeling, and Budget Forcing. Our findings indicate that although “thinking LLMs” have recently garnered significant attention, their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. More importantly, all tested methods fail to generalize robustly across languages, achieving only modest gains that are smaller than those observed in English, with no improvements in variance or consistency. To foster further research, we release MCLM and MR1-1.5B (a multilingual LLM with reasoning capabilities) and our evaluation results.
pdf
bib
abs
Controlling Language Confusion in Multilingual LLMs
Nahyun Lee
|
Yeongseo Woo
|
Hyunwoo Ko
|
Guijin Son
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Large language models often suffer from language confusion, a phenomenon in which responses are partially or entirely generated in unintended languages. This critically degrades the user experience, especially in low-resource settings. We hypothesize that this issue stems from limitations in conventional fine-tuning objectives, such as supervised learning, which optimize the likelihood of correct tokens without explicitly penalizing undesired outputs such as cross-lingual mixing. Analysis of loss trajectories during pretraining further reveals that models fail to distinguish between monolingual and language-mixed texts, highlighting the absence of inherent pressure to avoid such confusion. In this work, we apply ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppressing language-confused generations. ORPO maintains strong language consistency, even under high decoding temperatures, while preserving general QA performance. Our findings suggest that incorporating appropriate penalty terms can effectively mitigate language confusion in multilingual models, particularly in low-resource scenarios.
pdf
bib
abs
FINKRX: Establishing Best Practices for Korean Financial NLP
Guijin Son
|
Hyunwoo Ko
|
Hanearl Jung
|
Chami Hwang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
In this work, we present the first open leaderboard for evaluating Korean large language models focused on finance. Operated for abouteight weeks, the leaderboard evaluated 1,119 submissions on a closed benchmark covering five MCQA categories: finance and accounting, stock price prediction, domestic company analysis, financial markets, and financial agent tasks and one open-ended qa task. Building on insights from these evaluations, we release an open instruction dataset of 80k instances and summarize widely used training strategies observed among top-performing models. Finally, we introduce FINKRX, a fully open and transparent LLM built using these best practices. We hope our contributions help advance the development of better and safer financial LLMs for Korean and other languages.
pdf
bib
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son
|
Hyunwoo Ko
|
Dasol Choi
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)
pdf
bib
abs
KMMLU: Measuring Massive Multitask Language Understanding in Korean
Guijin Son
|
Hanwool Lee
|
Sungdong Kim
|
Seungone Kim
|
Niklas Muennighoff
|
Taekyoon Choi
|
Cheonbok Park
|
Kang Min Yoo
|
Stella Biderman
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We propose KMMLU, a Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean evaluation tools heavily rely on translated versions of existing English benchmarks, KMMLU is collected from original Korean exams, thereby capturing linguistic and cultural aspects of the Korean language. Recent models struggle to show performance over 60%, significantly below the pass mark of the source exams (80%), highlighting the room for improvement. Notably, one-fifth of the questions in KMMLU require knowledge of Korean culture for accurate resolution. KMMLU thus provides a more accurate reflection of human preferences compared to translated versions of MMLU and offers deeper insights into LLMs’ shortcomings in Korean knowledge. The dataset and codes are made publicly available for future research.
pdf
bib
abs
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
|
Juyoung Suk
|
Ji Yong Cho
|
Shayne Longpre
|
Chaeeun Kim
|
Dongkeun Yoon
|
Guijin Son
|
Yejin Cho
|
Sheikh Shafayat
|
Jinheon Baek
|
Sue Hyun Park
|
Hyeonbin Hwang
|
Jinkyung Jo
|
Hyowon Cho
|
Haebin Shin
|
Seongyun Lee
|
Hanseok Oh
|
Noah Lee
|
Namgyu Ho
|
Se June Joo
|
Miyoung Ko
|
Yoonjoo Lee
|
Hyungjoo Chae
|
Jamin Shin
|
Joel Jang
|
Seonghyeon Ye
|
Bill Yuchen Lin
|
Sean Welleck
|
Graham Neubig
|
Moontae Lee
|
Kyungjae Lee
|
Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
2024
pdf
bib
abs
Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?
Guijin Son
|
SangWon Baek
|
Sangdae Nam
|
Ilgyun Jeong
|
Seungone Kim
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are typically prompted to follow a single instruction per inference call. In this work, we analyze whether LLMs also hold the capability to handle multiple instructions simultaneously, denoted as Multi-Task Inference. For this purpose, we introduce the MTI Bench (Multi-Task Inference Benchmark), a comprehensive evaluation benchmark encompassing 5,000 instances across 25 tasks. Each task in the MTI Bench involves 2 to 3 sub-tasks. As expected, we first demonstrate that Multi-Task Inference reduces the total inference time by × 1.46 times in average since it does not require multiple inference calls. Interestingly, contrary to the expectation that LLMs would perform better when tasks are divided, we find that state-of-the-art LLMs, such as Llama-2-Chat-70B and GPT-4, show up to 7.3% and 12.4% improved performance with Multi-Task Inference compared to Single-Task Inference on the MTI Bench. We release the MTI Bench dataset and our code at this [link](https://anonymous.4open.science/r/MTI-Bench-6F01).
pdf
bib
abs
KRX Bench: Automating Financial Benchmark Creation via Large Language Models
Guijin Son
|
Hyunjun Jeon
|
Chami Hwang
|
Hanearl Jung
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing
In this work, we introduce KRX-Bench, an automated pipeline for creating financial benchmarks via GPT-4. To demonstrate the effectiveness of the pipeline, we create KRX-Bench-POC, a benchmark assessing the knowledge of LLMs in real-world companies. This dataset comprises 1,002 questions, each focusing on companies across the U.S., Japanese, and Korean stock markets. We make our pipeline and dataset publicly available and integrate the evaluation code into EleutherAI’s Language Model Evaluation Harness.
pdf
bib
abs
ESG Classification by Implicit Rule Learning via GPT-4
Yun Hyojeong
|
Kim Chanyoung
|
Moonjeong Hahm
|
Kyuri Kim
|
Guijin Son
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing
In this work, we adopt multiple prompting, chain-of-thought reasoning, and in-context learning strategies to guide GPT-4 in solving ESG classification tasks. We rank second in the Korean subset for Shared Task ML-ESG-3 in Impact Type prediction. Furthermore, we adopt open models to explain their calibration and robustness to different prompting strategies. The longer general pre-training correlates with enhanced performance in financial downstream tasks.
pdf
bib
FINALE : Finance Domain Instruction-Tuning Dataset with High-Quality Rationales via Chain-of-Thought Prompting
Sangmin Lee
|
Suzie Oh
|
Saeran Park
|
Guijin Son
|
Pilsung Kang
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
pdf
bib
abs
HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
Guijin Son
|
Hanwool Lee
|
Suwan Kim
|
Huiseo Kim
|
Jae cheol Lee
|
Je Won Yeom
|
Jihyu Jung
|
Jung woo Kim
|
Songseong Kim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) trained on massive corpora demonstrate impressive capabilities in a wide range of tasks. While there are ongoing efforts to adapt these models to languages beyond English, the attention given to their evaluation methodologies remains limited. Current multilingual benchmarks often rely on back translations or re-implementations of English tests, limiting their capacity to capture unique cultural and linguistic nuances. To bridge this gap for the Korean language, we introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth. The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension. Unlike traditional evaluation suites focused on token and sequence classification or mathematical and logical reasoning, the HAE-RAE Bench emphasizes a model’s aptitude for recalling Korean-specific knowledge and cultural contexts. Comparative analysis with prior Korean benchmarks indicates that the HAE-RAE Bench presents a greater challenge to non-Korean models by disturbing abilities and knowledge learned from English being transferred.
2023
pdf
bib
Beyond Classification: Financial Reasoning in State-of-the-Art Language Models
Guijin Son
|
Hanearl Jung
|
Moonjeong Hahm
|
Keonju Na
|
Sol Jin
Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting