Pavel Stepachev
2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
2025
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell | Ona de Gibert | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Laurie Burchell | Ona de Gibert | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
HPLT’s Second Data Release
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.
2024
Exploring Very Low-Resource Translation with LLMs: The University of Edinburgh’s Submission to AmericasNLP 2024 Translation Task
Vivek Iyer | Bhavitvya Malik | Wenhao Zhu | Pavel Stepachev | Pinzhen Chen | Barry Haddow | Alexandra Birch
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Vivek Iyer | Bhavitvya Malik | Wenhao Zhu | Pavel Stepachev | Pinzhen Chen | Barry Haddow | Alexandra Birch
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
This paper describes the University of Edinburgh’s submission to the AmericasNLP 2024 shared task on the translation of Spanish into 11 indigenous American languages. We explore the ability of multilingual Large Language Models (LLMs) to model low-resource languages by continued pre-training with LoRA, and conduct instruction fine-tuning using a variety of datasets, demonstrating that this improves LLM performance. Furthermore, we demonstrate the efficacy of checkpoint averaging alongside decoding techniques like beam search and sampling, resulting in further improvements. We participate in all 11 translation directions.
HPLT’s First Release of Data and Models
Nikolay Arefyev | Mikko Aulamo | Pinzhen Chen | Ona de Gibert | Barry Haddow | Jindřich Helcl | Bhavitvya Malik | Gema Ramírez-Sánchez | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Nikolay Arefyev | Mikko Aulamo | Pinzhen Chen | Ona de Gibert | Barry Haddow | Jindřich Helcl | Bhavitvya Malik | Gema Ramírez-Sánchez | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation
Vivek Iyer | Bhavitvya Malik | Pavel Stepachev | Pinzhen Chen | Barry Haddow | Alexandra Birch
Proceedings of the Ninth Conference on Machine Translation
Vivek Iyer | Bhavitvya Malik | Pavel Stepachev | Pinzhen Chen | Barry Haddow | Alexandra Birch
Proceedings of the Ninth Conference on Machine Translation
Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource languages (LRLs) still lags significantly behind Neural Machine Translation (NMT) models. In this work, we explore what it would take to adapt LLMs for the low-resource setting. Particularly, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has seen reduced use in adapting LLMs for MT, while data diversity has been embraced to promote transfer across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer. Our experiments with three LLMs across two low-resourced language groups—Indigenous American and North-East Indian—reveal consistent trends, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve LRLs.
2021
Preserving high MT quality for content with inline tags
Konstantin Savenkov | Grigory Sapunov | Pavel Stepachev
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
Konstantin Savenkov | Grigory Sapunov | Pavel Stepachev
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
Attendees will learn about how we use machine translation to provide targeted, high MT quality for content with inline tags. We offer a new and innovative approach to inserting tags into the translated text in a way that reliably preserves their quality. This process can achieve better MT quality and lower costs, as it is MT-independent, and can be used for all languages, MT engines, and use cases.
2018
Multi-source synthetic treebank creation for improved cross-lingual dependency parsing
Francis Tyers | Mariya Sheyanova | Aleksandra Martynova | Pavel Stepachev | Konstantin Vinogorodskiy
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
Francis Tyers | Mariya Sheyanova | Aleksandra Martynova | Pavel Stepachev | Konstantin Vinogorodskiy
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
This paper describes a method of creating synthetic treebanks for cross-lingual dependency parsing using a combination of machine translation (including pivot translation), annotation projection and the spanning tree algorithm. Sentences are first automatically translated from a lesser-resourced language to a number of related highly-resourced languages, parsed and then the annotations are projected back to the lesser-resourced language, leading to multiple trees for each sentence from the lesser-resourced language. The final treebank is created by merging the possible trees into a graph and running the spanning tree algorithm to vote for the best tree for each sentence. We present experiments aimed at parsing Faroese using a combination of Danish, Swedish and Norwegian. In a similar experimental setup to the CoNLL 2018 shared task on dependency parsing we report state-of-the-art results on dependency parsing for Faroese using an off-the-shelf parser.
Search
Fix author
Co-authors
- Pinzhen Chen 5
- Barry Haddow 5
- Bhavitvya Malik 5
- Nikolay Arefyev 3
- Mikko Aulamo 3
- Laurie Burchell 3
- Jindřich Helcl 3
- Gema Ramírez-Sánchez 3
- Jörg Tiedemann 3
- Dusan Varis 3
- Ona de Gibert 3
- Marta Bañón 2
- Alexandra Birch 2
- Mariia Fedorova 2
- Liane Guillou 2
- Jan Hajic 2
- Erik Henriksson 2
- Vivek Iyer 2
- Andrey Kutuzov 2
- Veronika Laippala 2
- Farrokh Mehryary 2
- Vladislav Mikhailov 2
- Amanda Myntti 2
- Stephan Oepen 2
- Dayyán O’Brien 2
- Sampo Pyysalo 2
- David Samuel 2
- Jaume Zaragoza-Bernabeu 2
- Idris Abdulmumin 1
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Amit Agarwal 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Nicholas Andrews 1
- David Anugraha 1
- Michael Anugraha 1
- Catherine Arnett 1
- Mithil Bangera 1
- Yeshil Bangera 1
- Weerayut Buaphet 1
- Lanwenn ar C'horr 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Leshem Choshen 1
- Thibault Cl\'erice 1
- Rasul Dent 1
- Konstantin Dobler 1
- Karan Dua 1
- Vicky Feliren 1
- Luca Foppiano 1
- Dmitry Gaynullin 1
- Manuel Goul\~ao 1
- Tommaso Green 1
- Muhammad Ravi Shulthan Habibi 1
- Nadia Ghezaiel Hammouda 1
- Ikhlasul Akmal Hanif 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Fenal Ashokbhai Ilasariya 1
- Joseph Marvin Imperial 1
- Amr Keleg 1
- Kun Kerdthaisong 1
- Ilker Kesen 1
- Jun Kevin 1
- Mateusz Klimaszewski 1
- Ville Komulainen 1
- Bruhan Kyomuhendo 1
- Joona Kytöniemi 1
- Yiyuan Li 1
- Sarah K. K. Luger 1
- Jean Maillard 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Aleksandra Martynova 1
- Juan Pablo Martínez 1
- Nikita Moghe 1
- Sara Hincapi\'e Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Shamsuddeen Hassan Muhammad 1
- Kenton Murray 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Petter Mæhlum 1
- Hamada Nayel 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- Melika Nobakhtian 1
- Shu Okabe 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Quentin Pag\`es 1
- Proyag Pal 1
- Srikant Panda 1
- Hitesh Laxmichand Patel 1
- Jousia Piha 1
- Vallerie Alexandra Putra 1
- Ingrid Gabriela Franco Ramirez 1
- Benjamin L Rice 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Benoît Sagot 1
- Luis Frentzen Salim 1
- Saron Samuel 1
- Grigory Sapunov 1
- Konstantin Savenkov 1
- Jakhongir Saydaliev 1
- Mariya Sheyanova 1
- Damian Stewart 1
- Sotaro Takeshita 1
- Filbert Aurelian Tjiaranata 1
- Atnafu Lambebo Tonja 1
- Yassine Toughrai 1
- Francis Tyers 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Konstantin Vinogorodskiy 1
- Tereza Vojtěchová 1
- Ahmad Mustapha Wali 1
- Azmine Toushik Wasi 1
- Genta Indra Winata 1
- Tack Hwa Wong 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Jaume Zaragoza 1
- Mike Zhang 1
- Ej Zhou 1
- Wenhao Zhu 1