Andrew Matteson


2026

With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on CLIR and Mono-Lingual Information Retrieval (Mono-IR) performance remains underexplored. To investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train multilingual retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influence IR performance, exhibiting important inter-lingual correlations: Using specific language pairs improves CLIR performance, while declines Mono-IR performance. Our work demonstrates that simple weight-averaged model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-IR capabilities. Our findings highlight the effects of linguistic configuration of training data on both CLIR and Mono-IR, and present model merging as a viable strategy to optimize performance across these tasks.

2018

Due to the fact that Korean is a highly agglutinative, character-rich language, previous work on Korean morphological analysis typically employs the use of sub-character features known as graphemes or otherwise utilizes comprehensive prior linguistic knowledge (i.e., a dictionary of known morphological transformation forms, or actions). These models have been created with the assumption that character-level, dictionary-less morphological analysis was intractable due to the number of actions required. We present, in this study, a multi-stage action-based model that can perform morphological transformation and part-of-speech tagging using arbitrary units of input and apply it to the case of character-level Korean morphological analysis. Among models that do not employ prior linguistic knowledge, we achieve state-of-the-art word and sentence-level tagging accuracy with the Sejong Korean corpus using our proposed data-driven Bi-LSTM model.