Guillermo Gabrielli
2025
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
Artur Kiulian
|
Anton Polishko
|
Mykola Khandoga
|
Yevhen Kostiuk
|
Guillermo Gabrielli
|
Łukasz Gagała
|
Fadi Zaraket
|
Qusai Abu Obaida
|
Hrishikesh Garud
|
Wendy Wing Yee Mak
|
Dmytro Chaplynskyi
|
Selma Amor
|
Grigol Peradze
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script—Ukrainian, Arabic, and Georgian.Our approach demonstrates improved language performance while reducing computational costs. It mitigates the disproportionate penalization of underrepresented languages, promoting fairness and minimizing adverse phenomena such as code-switching and broken grammar. Additionally, we introduce new metrics to evaluate language quality, revealing that vocabulary size significantly impacts the quality of generated text.
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains
Yurii Paniv
|
Artur Kiulian
|
Dmytro Chaplynskyi
|
Mykola Khandoga
|
Anton Polishko
|
Tetiana Bas
|
Guillermo Gabrielli
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from the standardized university entrance examination (ZNO). The benchmark consists of over 4300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.
Search
Fix author
Co-authors
- Dmytro Chaplynskyi 2
- Mykola Khandoga 2
- Artur Kiulian 2
- Anton Polishko 2
- Qusai Abu Obaida 1
- show all...