Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Hyeonseok Moon, Seungyoon Lee, Andrew Matteson, Heuiseok Lim


Abstract
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on CLIR and Mono-Lingual Information Retrieval (Mono-IR) performance remains underexplored. To investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train multilingual retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influence IR performance, exhibiting important inter-lingual correlations: Using specific language pairs improves CLIR performance, while declines Mono-IR performance. Our work demonstrates that simple weight-averaged model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-IR capabilities. Our findings highlight the effects of linguistic configuration of training data on both CLIR and Mono-IR, and present model merging as a viable strategy to optimize performance across these tasks.
Anthology ID:
2026.mellm-1.3
Volume:
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Month:
July
Year:
2026
Address:
San Diego, United States
Editors:
Kaiyu Huang, Fengran Mo, Pinzhen Chen, Meng Jiang
Venues:
MeLLM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30–43
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.3/
DOI:
Bibkey:
Cite (ACL):
Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Hyeonseok Moon, Seungyoon Lee, Andrew Matteson, and Heuiseok Lim. 2026. Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging. In Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), pages 30–43, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging (Jang et al., MeLLM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.3.pdf