Dasol Choi

2025

pdf bib
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son | Hyunwoo Ko | Dasol Choi
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

pdf bib abs
No Language Data Left Behind: A Cross-Cultural Study of CJK Language Datasets in the Hugging Face Ecosystem
Dasol Choi | Woomyoung Park | Youngsook Song
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages, particularly Chinese, Japanese, and Korean (CJK), remains fragmented and underexplored, despite these languages serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing, guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

pdf bib abs
Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
Hyunwoo Ko | Guijin Son | Dasol Choi
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST(Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, method achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6%% to 0.7%%. Additionally, we show that improvements from method generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.

Co-authors

Venues

Fix author