Iñaki Lacunza
Also published as: Iñaki Lacunza Castilla
2026
ACAData: Parallel Dataset of Academic Data for Machine Translation
Iñaki Lacunza | Javier Garcia Gilabert | Francesca De Luca Fornaciari | Javier Aula-Blasco | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Iñaki Lacunza | Javier Garcia Gilabert | Francesca De Luca Fornaciari | Javier Aula-Blasco | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present ACAData, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-Train, which contains approximately 1.5 million human-generated paragraph pairs across 12 languages, and ACAD-Bench, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its usefulness, we fine-tune two Large Language Models (LLMs) on ACAD-Train and benchmark them on ACAD-Bench against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine tuning on ACAD-Train leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best proprietary and open-weight models on the academic translation domain. By releasing ACAD-Train, ACAD-Bench and the fine-tuned models, we provide the community with a valuable resource to advance research in the academic domain and long-context translation.
2025
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Iñigo Pikabea | Iñaki Lacunza | Oriol Pareras Velasco | Carlos Escolano | Aitor Gonzalez-Agirre | Javier Hernando | Marta Villegas
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Iñigo Pikabea | Iñaki Lacunza | Oriol Pareras Velasco | Carlos Escolano | Aitor Gonzalez-Agirre | Javier Hernando | Marta Villegas
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model’s original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.
2024
Community OSCAR: A Community Effort for Multilingual Web Data
Manuel Brack | Malte Ostendorff | Pedro Ortiz Suarez | José Javier Saiz | Iñaki Lacunza Castilla | Jorge Palomar-Giner | Alexander Shvets | Patrick Schramowski | Georg Rehm | Marta Villegas | Kristian Kersting
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Manuel Brack | Malte Ostendorff | Pedro Ortiz Suarez | José Javier Saiz | Iñaki Lacunza Castilla | Jorge Palomar-Giner | Alexander Shvets | Patrick Schramowski | Georg Rehm | Marta Villegas | Kristian Kersting
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
The development of large language models (LLMs) relies heavily on extensive, high-quality datasets. Publicly available datasets focus predominantly on English, leaving other language communities behind. To address this issue, we introduce Community OSCAR, a multilingual dataset initiative designed to address the gap between English and non-English data availability. Through a collective effort, Community OSCAR covers over 150 languages with 45 billion documents, totaling over 345 TiB of data. Initial results indicate that Community OSCAR provides valuable raw data for training LLMs and enhancing the performance of multilingual models. This work aims to contribute to the ongoing advancements in multilingual NLP and to support a more inclusive AI ecosystem by making high-quality, multilingual data more accessible to those working with low-resource languages.