2024
pdf
abs
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert
|
Graeme Nail
|
Nikolay Arefyev
|
Marta Bañón
|
Jelmer van der Linde
|
Shaoxiong Ji
|
Jaume Zaragoza-Bernabeu
|
Mikko Aulamo
|
Gema Ramírez-Sánchez
|
Andrey Kutuzov
|
Sampo Pyysalo
|
Stephan Oepen
|
Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
2023
pdf
abs
HPLT: High Performance Language Technologies
Mikko Aulamo
|
Nikolay Bogoychev
|
Shaoxiong Ji
|
Graeme Nail
|
Gema Ramírez-Sánchez
|
Jörg Tiedemann
|
Jelmer van der Linde
|
Jaume Zaragoza
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).
2022
pdf
abs
Findings of the WMT 2022 Shared Task on Efficient Translation
Kenneth Heafield
|
Biao Zhang
|
Graeme Nail
|
Jelmer Van Der Linde
|
Nikolay Bogoychev
Proceedings of the Seventh Conference on Machine Translation (WMT)
The machine translation efficiency task challenges participants to make their systems faster and smaller with minimal impact on translation quality. How much quality to sacrifice for efficiency depends upon the application, so participants were encouraged to make multiple submissions covering the space of trade-offs. In total, there were 76 submissions from 5 teams. The task covers GPU, single-core CPU, and multi-core CPU hardware tracks as well as batched throughput or single-sentence latency conditions. Submissions showed hundreds of millions of words can be translated for a dollar, average latency is 3.5–25 ms, and models fit in 7.5–900 MB.
pdf
abs
Edinburgh’s Submission to the WMT 2022 Efficiency Task
Nikolay Bogoychev
|
Maximiliana Behnke
|
Jelmer Van Der Linde
|
Graeme Nail
|
Kenneth Heafield
|
Biao Zhang
|
Sidharth Kashyap
Proceedings of the Seventh Conference on Machine Translation (WMT)
We participated in all tracks of the WMT 2022 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions explores a number of several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, shortlisting, deep encoder, shallow decoder, pruning and bidirectional decoder. For the CPU track, we used quantized 8-bit models. For the GPU track, we used FP16 quantisation. We explored various pruning strategies and combination of one or more of the above methods.
2021
pdf
abs
Efficient Machine Translation with Model Pruning and Quantization
Maximiliana Behnke
|
Nikolay Bogoychev
|
Alham Fikri Aji
|
Kenneth Heafield
|
Graeme Nail
|
Qianqian Zhu
|
Svetlana Tchistiakova
|
Jelmer van der Linde
|
Pinzhen Chen
|
Sidharth Kashyap
|
Roman Grundkiewicz
Proceedings of the Sixth Conference on Machine Translation
We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.