Laurie Burchell
2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
2025
Findings of the WMT 2025 Shared Task of the Open Language Data Initiative
David Dale | Laurie Burchell | Jean Maillard | Idris Abdulmumin | Antonios Anastasopoulos | Isaac Caswell | Philipp Koehn
Proceedings of the Tenth Conference on Machine Translation
David Dale | Laurie Burchell | Jean Maillard | Idris Abdulmumin | Antonios Anastasopoulos | Isaac Caswell | Philipp Koehn
Proceedings of the Tenth Conference on Machine Translation
We present the results of the WMT 2025 shared task of the Open Language Data Initiative. Participants were invited to contribute to the massively multilingual open datasets (FLORES+, MT Seed, WMT24++) or create new such resources. We accepted 8 submissions, including 7 extensions or revisions of the existing datasets and one submission with a new parallel training dataset, SMOL.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell | Ona de Gibert | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Laurie Burchell | Ona de Gibert | Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Pinzhen Chen | Mariia Fedorova | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Mateusz Klimaszewski | Ville Komulainen | Andrey Kutuzov | Joona Kytöniemi | Veronika Laippala | Petter Mæhlum | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Nikita Moghe | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Proyag Pal | Jousia Piha | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Tereza Vojtěchová | Jaume Zaragoza-Bernabeu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
HPLT’s Second Data Release
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
Nikolay Arefyev | Mikko Aulamo | Marta Bañón | Laurie Burchell | Pinzhen Chen | Mariia Fedorova | Ona de Gibert | Liane Guillou | Barry Haddow | Jan Hajič | Jindřich Helcl | Erik Henriksson | Andrey Kutuzov | Veronika Laippala | Bhavitvya Malik | Farrokh Mehryary | Vladislav Mikhailov | Amanda Myntti | Dayyán O’Brien | Stephan Oepen | Sampo Pyysalo | Gema Ramírez-Sánchez | David Samuel | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu
Proceedings of Machine Translation Summit XX: Volume 2
We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.
2024
Code-Switched Language Identification is Harder Than You Think
Laurie Burchell | Alexandra Birch | Robert Thompson | Kenneth Heafield
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Laurie Burchell | Alexandra Birch | Robert Thompson | Kenneth Heafield
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Code switching (CS) is a very common phenomenon in written and spoken communication, but is handled poorly by many NLP applications. Looking to the application of building CS corpora, we explore CS language identification for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable architectures for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate, and finally provide recommendations for future work in this area.
Findings of the WMT 2024 Shared Task of the Open Language Data Initiative
Laurie Burchell | Jean Maillard | Antonios Anastasopoulos | Christian Federmann | Philipp Koehn | Skyler Wang
Proceedings of the Ninth Conference on Machine Translation
Laurie Burchell | Jean Maillard | Antonios Anastasopoulos | Christian Federmann | Philipp Koehn | Skyler Wang
Proceedings of the Ninth Conference on Machine Translation
We present the results of the WMT 2024 shared task of the Open Language Data Initiative. Participants were invited to contribute to the FLORES+ and MT Seed multilingual datasets, two foundational open resources that facilitate the organic expansion of language technology’s reach. We accepted ten submissions covering 16 languages, which extended the range of languages included in the datasets and improved the quality of existing data.
2023
An Open Dataset and Model for Language Identification
Laurie Burchell | Alexandra Birch | Nikolay Bogoychev | Kenneth Heafield
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Laurie Burchell | Alexandra Birch | Nikolay Bogoychev | Kenneth Heafield
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033% across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model’s performance, both in comparison to existing open models and by language class.
2022
The University of Edinburgh’s Submission to the WMT22 Code-Mixing Shared Task (MixMT)
Faheem Kirefu | Vivek Iyer | Pinzhen Chen | Laurie Burchell
Proceedings of the Seventh Conference on Machine Translation (WMT)
Faheem Kirefu | Vivek Iyer | Pinzhen Chen | Laurie Burchell
Proceedings of the Seventh Conference on Machine Translation (WMT)
The University of Edinburgh participated in the WMT22 shared task on code-mixed translation. This consists of two subtasks: i) generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English. As both subtasks are considered low-resource, we focused our efforts on careful data generation and curation, especially the use of backtranslation from monolingual resources. For subtask 1 we explored the effects of constrained decoding on English and transliterated subwords in order to produce Hinglish. For subtask 2, we investigated different pretraining techniques, namely comparing simple initialisation from existing machine translation models and aligned augmentation. For both subtasks, we found that our baseline systems worked best. Our systems for both subtasks were one of the overall top-performing submissions.
Exploring diversity in back translation for low-resource machine translation
Laurie Burchell | Alexandra Birch | Kenneth Heafield
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Laurie Burchell | Alexandra Birch | Kenneth Heafield
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the ‘diversity’ of the generated translations. We argue that the definitions and metrics used to quantify ‘diversity’ in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource English↔Turkish and mid-resource English↔Icelandic. Our findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity. We also find evidence that lexical diversity is more important than syntactic for back translation performance.
2021
The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task
Pinzhen Chen | Jindřich Helcl | Ulrich Germann | Laurie Burchell | Nikolay Bogoychev | Antonio Valerio Miceli Barone | Jonas Waldendorf | Alexandra Birch | Kenneth Heafield
Proceedings of the Sixth Conference on Machine Translation
Pinzhen Chen | Jindřich Helcl | Ulrich Germann | Laurie Burchell | Nikolay Bogoychev | Antonio Valerio Miceli Barone | Jonas Waldendorf | Alexandra Birch | Kenneth Heafield
Proceedings of the Sixth Conference on Machine Translation
This paper presents the University of Edinburgh’s constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three stages: corpus filtering, back-translation, and fine-tuning. For En-Ha we use an iterative back-translation approach on top of pre-trained En-De models and investigate vocabulary embedding mapping.
2020
Querent Intent in Multi-Sentence Questions
Laurie Burchell | Jie Chi | Tom Hosking | Nina Markl | Bonnie Webber
Proceedings of the 14th Linguistic Annotation Workshop
Laurie Burchell | Jie Chi | Tom Hosking | Nina Markl | Bonnie Webber
Proceedings of the 14th Linguistic Annotation Workshop
Multi-sentence questions (MSQs) are sequences of questions connected by relations which, unlike sequences of standalone questions, need to be answered as a unit. Following Rhetorical Structure Theory (RST), we recognise that different “question discourse relations” between the subparts of MSQs reflect different speaker intents, and consequently elicit different answering strategies. Correctly identifying these relations is therefore a crucial step in automatically answering MSQs. We identify five different types of MSQs in English, and define five novel relations to describe them. We extract over 162,000 MSQs from Stack Exchange to enable future research. Finally, we implement a high-precision baseline classifier based on surface features.
Search
Fix author
Co-authors
- Alexandra Birch 4
- Pinzhen Chen 4
- Kenneth Heafield 4
- Jindřich Helcl 3
- Jean Maillard 3
- Pavel Stepachev 3
- Idris Abdulmumin 2
- Antonios Anastasopoulos 2
- Nikolay Arefyev 2
- Mikko Aulamo 2
- Marta Bañón 2
- Nikolay Bogoychev 2
- Mariia Fedorova 2
- Liane Guillou 2
- Barry Haddow 2
- Jan Hajic 2
- Erik Henriksson 2
- Philipp Koehn 2
- Andrey Kutuzov 2
- Veronika Laippala 2
- Bhavitvya Malik 2
- Farrokh Mehryary 2
- Vladislav Mikhailov 2
- Amanda Myntti 2
- Stephan Oepen 2
- Dayyán O’Brien 2
- Sampo Pyysalo 2
- Gema Ramírez-Sánchez 2
- David Samuel 2
- Jörg Tiedemann 2
- Dusan Varis 2
- Jaume Zaragoza-Bernabeu 2
- Ona de Gibert 2
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Amit Agarwal 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Nicholas Andrews 1
- David Anugraha 1
- Michael Anugraha 1
- Catherine Arnett 1
- Mithil Bangera 1
- Yeshil Bangera 1
- Weerayut Buaphet 1
- Lanwenn ar C'horr 1
- Isaac Caswell 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Jie Chi 1
- Leshem Choshen 1
- Thibault Cl\'erice 1
- David Dale 1
- Rasul Dent 1
- Konstantin Dobler 1
- Karan Dua 1
- Christian Federmann 1
- Vicky Feliren 1
- Luca Foppiano 1
- Dmitry Gaynullin 1
- Ulrich Germann 1
- Manuel Goul\~ao 1
- Tommaso Green 1
- Muhammad Ravi Shulthan Habibi 1
- Nadia Ghezaiel Hammouda 1
- Ikhlasul Akmal Hanif 1
- Tom Hosking 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Fenal Ashokbhai Ilasariya 1
- Joseph Marvin Imperial 1
- Vivek Iyer 1
- Amr Keleg 1
- Kun Kerdthaisong 1
- Ilker Kesen 1
- Jun Kevin 1
- Faheem Kirefu 1
- Mateusz Klimaszewski 1
- Ville Komulainen 1
- Bruhan Kyomuhendo 1
- Joona Kytöniemi 1
- Yiyuan Li 1
- Sarah K. K. Luger 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Nina Markl 1
- Juan Pablo Martínez 1
- Antonio Valerio Miceli-Barone 1
- Nikita Moghe 1
- Sara Hincapi\'e Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Shamsuddeen Hassan Muhammad 1
- Kenton Murray 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Petter Mæhlum 1
- Hamada Nayel 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- Melika Nobakhtian 1
- Shu Okabe 1
- Pedro Ortiz Suarez 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Quentin Pag\`es 1
- Proyag Pal 1
- Srikant Panda 1
- Hitesh Laxmichand Patel 1
- Jousia Piha 1
- Vallerie Alexandra Putra 1
- Ingrid Gabriela Franco Ramirez 1
- Benjamin L Rice 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Benoît Sagot 1
- Luis Frentzen Salim 1
- Saron Samuel 1
- Jakhongir Saydaliev 1
- Damian Stewart 1
- Sotaro Takeshita 1
- Robert Thompson 1
- Filbert Aurelian Tjiaranata 1
- Atnafu Lambebo Tonja 1
- Yassine Toughrai 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Tereza Vojtěchová 1
- Jonas Waldendorf 1
- Ahmad Mustapha Wali 1
- Skyler Wang 1
- Azmine Toushik Wasi 1
- Bonnie Webber 1
- Genta Indra Winata 1
- Tack Hwa Wong 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Mike Zhang 1
- Ej Zhou 1