2025
pdf
bib
abs
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell
|
Ona de Gibert
|
Nikolay Arefyev
|
Mikko Aulamo
|
Marta Bañón
|
Pinzhen Chen
|
Mariia Fedorova
|
Liane Guillou
|
Barry Haddow
|
Jan Hajič
|
Jindřich Helcl
|
Erik Henriksson
|
Mateusz Klimaszewski
|
Ville Komulainen
|
Andrey Kutuzov
|
Joona Kytöniemi
|
Veronika Laippala
|
Petter Mæhlum
|
Bhavitvya Malik
|
Farrokh Mehryary
|
Vladislav Mikhailov
|
Nikita Moghe
|
Amanda Myntti
|
Dayyán O’Brien
|
Stephan Oepen
|
Proyag Pal
|
Jousia Piha
|
Sampo Pyysalo
|
Gema Ramírez-Sánchez
|
David Samuel
|
Pavel Stepachev
|
Jörg Tiedemann
|
Dušan Variš
|
Tereza Vojtěchová
|
Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
pdf
bib
abs
HPLT’s Second Data Release
Nikolay Arefyev
|
Mikko Aulamo
|
Marta Bañón
|
Laurie Burchell
|
Pinzhen Chen
|
Mariia Fedorova
|
Ona de Gibert
|
Liane Guillou
|
Barry Haddow
|
Jan Hajič
|
Jindřich Helcl
|
Erik Henriksson
|
Andrey Kutuzov
|
Veronika Laippala
|
Bhavitvya Malik
|
Farrokh Mehryary
|
Vladislav Mikhailov
|
Amanda Myntti
|
Dayyán O’Brien
|
Stephan Oepen
|
Sampo Pyysalo
|
Gema Ramírez-Sánchez
|
David Samuel
|
Pavel Stepachev
|
Jörg Tiedemann
|
Dušan Variš
|
Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.
pdf
bib
abs
Mind the Gap: Diverse NMT Models for Resource-Constrained Environments
Ona de Gibert
|
Dayyán O’Brien
|
Dušan Variš
|
Jörg Tiedemann
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
We present fast Neural Machine Translation models for 17 diverse languages, developed using Sequence-level Knowledge Distillation. Our selected languages span multiple language families and scripts, including low-resource languages. The distilled models achieve comparable performance while being 10x times faster than transformer-base and 35x times faster than transformer-big architectures. Our experiments reveal that teacher model quality and capacity strongly influence the distillation success, as well as the language script. We also explore the effectiveness of multilingual students. We release publicly our code and models in our Github repository: anonymised.
pdf
bib
abs
DocHPLT: A Massively Multilingual Document-Level Translation Dataset
Dayyán O’brien
|
Bhavitvya Malik
|
Ona De Gibert
|
Pinzhen Chen
|
Barry Haddow
|
Jörg Tiedemann
Proceedings of the Tenth Conference on Machine Translation
Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.