Andrew Matteson
2026
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
Youngjoon Jang | Junyoung Son | Taemin Lee | Seongtae Hong | Hyeonseok Moon | Seungyoon Lee | Andrew Matteson | Heuiseok Lim
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Youngjoon Jang | Junyoung Son | Taemin Lee | Seongtae Hong | Hyeonseok Moon | Seungyoon Lee | Andrew Matteson | Heuiseok Lim
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on CLIR and Mono-Lingual Information Retrieval (Mono-IR) performance remains underexplored. To investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train multilingual retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influence IR performance, exhibiting important inter-lingual correlations: Using specific language pairs improves CLIR performance, while declines Mono-IR performance. Our work demonstrates that simple weight-averaged model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-IR capabilities. Our findings highlight the effects of linguistic configuration of training data on both CLIR and Mono-IR, and present model merging as a viable strategy to optimize performance across these tasks.
2018
Rich Character-Level Information for Korean Morphological Analysis and Part-of-Speech Tagging
Andrew Matteson | Chanhee Lee | Youngbum Kim | Heuiseok Lim
Proceedings of the 27th International Conference on Computational Linguistics
Andrew Matteson | Chanhee Lee | Youngbum Kim | Heuiseok Lim
Proceedings of the 27th International Conference on Computational Linguistics
Due to the fact that Korean is a highly agglutinative, character-rich language, previous work on Korean morphological analysis typically employs the use of sub-character features known as graphemes or otherwise utilizes comprehensive prior linguistic knowledge (i.e., a dictionary of known morphological transformation forms, or actions). These models have been created with the assumption that character-level, dictionary-less morphological analysis was intractable due to the number of actions required. We present, in this study, a multi-stage action-based model that can perform morphological transformation and part-of-speech tagging using arbitrary units of input and apply it to the case of character-level Korean morphological analysis. Among models that do not employ prior linguistic knowledge, we achieve state-of-the-art word and sentence-level tagging accuracy with the Sejong Korean corpus using our proposed data-driven Bi-LSTM model.