Jaume Zaragoza
2026
Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models
Stephan Oepen | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Maja Buljan | Laurie V. Burchell | Lucas Georges Gabriel Charpentier | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Barry Haddow | Jan Hajič | Jindrich Helcl | Andrey Kutuzov | Veronika Laippala | Zihao Li | Bhavitvya Malik | Vladislav Mikhailov | Amanda Myntti | Dayyán O'Brien | Lucie Polakova | Gema Ramírez-Sánchez | Janine Siewert | Pavel Stepachev | Joerg Tiedemann | Teemu Vahtola | Dusan Varis | Fedor Vitiugin | Jaume Zaragoza
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Stephan Oepen | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Maja Buljan | Laurie V. Burchell | Lucas Georges Gabriel Charpentier | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Barry Haddow | Jan Hajič | Jindrich Helcl | Andrey Kutuzov | Veronika Laippala | Zihao Li | Bhavitvya Malik | Vladislav Mikhailov | Amanda Myntti | Dayyán O'Brien | Lucie Polakova | Gema Ramírez-Sánchez | Janine Siewert | Pavel Stepachev | Joerg Tiedemann | Teemu Vahtola | Dusan Varis | Fedor Vitiugin | Jaume Zaragoza
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for some 20 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder–decoder models, as well as about 30 “smallish” monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.
2024
HPLT’s First Release of Data and Models
Nikolay Arefyev | Mikko Aulamo | Pinzhen Chen | Ona de Gibert | Barry Haddow | Jindřich Helcl | Bhavitvya Malik | Gema Ramírez-Sánchez | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Nikolay Arefyev | Mikko Aulamo | Pinzhen Chen | Ona de Gibert | Barry Haddow | Jindřich Helcl | Bhavitvya Malik | Gema Ramírez-Sánchez | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
2023
HPLT: High Performance Language Technologies
Mikko Aulamo | Nikolay Bogoychev | Shaoxiong Ji | Graeme Nail | Gema Ramírez-Sánchez | Jörg Tiedemann | Jelmer van der Linde | Jaume Zaragoza
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Mikko Aulamo | Nikolay Bogoychev | Shaoxiong Ji | Graeme Nail | Gema Ramírez-Sánchez | Jörg Tiedemann | Jelmer van der Linde | Jaume Zaragoza
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).
2022
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Miquel Esplà-Gomis | Mikel L. Forcada | Cristian García-Romero | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vít Suchomel | Antonio Toral | Tobias van der Werff | Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Marta Bañón | Miquel Esplà-Gomis | Mikel L. Forcada | Cristian García-Romero | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vít Suchomel | Antonio Toral | Tobias van der Werff | Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.
2020
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
Search
Fix author
Co-authors
- Gema Ramírez-Sánchez 5
- Mikko Aulamo 3
- Marta Bañón 3
- Pinzhen Chen 3
- Barry Haddow 3
- Jörg Tiedemann 3
- Nikolay Arefyev 2
- Miquel Esplà-Gomis 2
- Mikel L. Forcada 2
- Jindřich Helcl 2
- Bhavitvya Malik 2
- Leopoldo Pla Sempere 2
- Pavel Stepachev 2
- Dusan Varis 2
- Ona de Gibert 2
- Nikolay Bogoychev 1
- Maja Buljan 1
- Laurie Burchell 1
- Mariia Fedorova 1
- Cristian García-Romero 1
- Lucas Georges Gabriel Charpentier 1
- Jan Hajic 1
- Kenneth Heafield 1
- Hieu Hoang 1
- Shaoxiong Ji 1
- Amir Kamran 1
- Faheem Kirefu 1
- Philipp Koehn 1
- Andrey Kutuzov 1
- Taja Kuzman 1
- Veronika Laippala 1
- Zihao Li 1
- Nikola Ljubešić 1
- Vladislav Mikhailov 1
- Amanda Myntti 1
- Graeme Nail 1
- Stephan Oepen 1
- Sergio Ortiz Rojas 1
- Dayyán O’Brien 1
- Lucie Polakova 1
- Peter Rupnik 1
- Elsa Sarrías 1
- Janine Siewert 1
- Marek Strelec 1
- Vit Suchomel 1
- Brian Thompson 1
- Antonio Toral 1
- Teemu Vahtola 1
- Jelmer Van Der Linde 1
- Fedor Vitiugin 1
- William Waites 1
- Dion Wiggins 1
- Rik van Noord 1
- Tobias van der Werff 1