Rasul Dent
2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapi\'e Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C'horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Mart{\'\i}nez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goul\~ao | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pag\`es | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Beno{\^\i}t Sagot | Thibault Cl\'erice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
How Should We Model the Probability of a Language?
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
2025
Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Findings of the Association for Computational Linguistics: EMNLP 2025
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Findings of the Association for Computational Linguistics: EMNLP 2025
Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora covering previously overlooked languages faster than existing pipelines. To demonstrate the effectiveness of the targeted mining perspective, we introduce a new pipeline that can filter a single snapshot in two hours. We also provide web corpora for several French-based Creoles.
COLaF : Corpus et Outils pour les Langues de France et variétés de français
Benoît Sagot | Slim Ouni | Sam Bigeard | Lucence Ing | Thibault Clérice | Rachel Bawden | Emmanuel Vincent | Malek Yaich | Panagiotis Tsolakis | Juliette Janès | Rasul Dent | Oriane Nédey | Vincent Colotte | Mostafa Sadeghi
Actes de la session industrielle de CORIA-TALN 2025
Benoît Sagot | Slim Ouni | Sam Bigeard | Lucence Ing | Thibault Clérice | Rachel Bawden | Emmanuel Vincent | Malek Yaich | Panagiotis Tsolakis | Juliette Janès | Rasul Dent | Oriane Nédey | Vincent Colotte | Mostafa Sadeghi
Actes de la session industrielle de CORIA-TALN 2025
Nous présentons COLaF, un projet dédié à la collecte et au développement d’outils et de ressources de traitement automatique des langues (TAL) pour le français et les autres langues de France, avec une attention particulière sur les langues et variétés moins dotées. Le projet concerne les données textuelles, audio et vidéo, afin de fournir des corpus et des outils pour le langage écrit, parlé et signé. Le projet inclut la collecte, la normalisation et la documentation de données préexistantes, y compris des données actuellement non accessibles ou non exploitables à des fins de recherche, ainsi que le développement d’outils de TAL adaptés à ces langues, comme des outils pour l’annotation linguistique et pour la traduction automatique. Cet article permet la présentation des principaux défis posés par le projet et de premiers résultats.
Findings of the First Shared Task for Creole Language Machine Translation at WMT25
Nathaniel Robinson | Claire Bizon Monroc | Rasul Dent | Stefan Watson | Kenton Murray | Raj Dabre | Andre Coy | Heather Lent
Proceedings of the Tenth Conference on Machine Translation
Nathaniel Robinson | Claire Bizon Monroc | Rasul Dent | Stefan Watson | Kenton Murray | Raj Dabre | Andre Coy | Heather Lent
Proceedings of the Tenth Conference on Machine Translation
Efforts towards better machine translation (MT) for Creole languages have historically been isolated, due to Creole languages’ geographic and linguistic diversity. However, most speakers of Creole languages stand to benefit from improved MT for low-resource languages. To galvanize collaboration for Creole MT across the NLP community, we introduce the First Shared Task for Creole Language Machine Translation at WMT25. This Shared Task consists of two systems tracks and one data track, for which we received submissions from five participating teams. Participants experimented with a wide variety of systems development techniques. Our evaluation campaign gave rise to improvements in MT performance in several languages, and particularly large improvements in new testing genres, though some participants found that reusing subsets of pretraining data for specialized post-training did not yield significant improvements. Our campaign also yielded new test sets for Mauritian Creole and a vast expansion of public training data for two Creole languages of Latin America.
2024
Molyé: A Corpus-based Approach to Language Contact in Colonial France
Rasul Dent | Juliette Janes | Thibault Clerice | Pedro Ortiz Suarez | Benoît Sagot
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Rasul Dent | Juliette Janes | Thibault Clerice | Pedro Ortiz Suarez | Benoît Sagot
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages
Nathaniel R. Robinson | Raj Dabre | Ammon Shurtz | Rasul Dent | Onenamiyi Onesi | Claire Bizon Monroc | Loïc Grobol | Hasan Muhammad | Ashi Garg | Naome A. Etori | Vijay Murari Tiyyala | Olanrewaju Samuel | Matthew Dean Stutzman | Bismarck Bamfo Odoom | Sanjeev Khudanpur | Stephen D. Richardson | Kenton Murray
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Nathaniel R. Robinson | Raj Dabre | Ammon Shurtz | Rasul Dent | Onenamiyi Onesi | Claire Bizon Monroc | Loïc Grobol | Hasan Muhammad | Ashi Garg | Naome A. Etori | Vijay Murari Tiyyala | Olanrewaju Samuel | Matthew Dean Stutzman | Bismarck Bamfo Odoom | Sanjeev Khudanpur | Stephen D. Richardson | Kenton Murray
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations—11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages—the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity then ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 23 of 34 translation directions.
Search
Fix author
Co-authors
- Benoît Sagot 5
- Thibault Clérice 4
- Pedro Ortiz Suarez 4
- Kenton Murray 3
- Claire Bizon Monroc 2
- Raj Dabre 2
- Juliette Janès 2
- Idris Abdulmumin 1
- Abdulhamid Abubakar 1
- Faisal Muhammad Adam 1
- Esther Adenuga 1
- Amit Agarwal 1
- Cristina Aggazzotti 1
- Raia Abu Ahmad 1
- Akshata 1
- Hend Al-Khalifa 1
- Jesujoba Alabi 1
- Reem Alqifari 1
- Azril Hafizi Amirudin 1
- Cynthia Jayne Amol 1
- Nicholas Andrews 1
- David Anugraha 1
- Michael Anugraha 1
- Catherine Arnett 1
- Bismarck Bamfo Odoom 1
- Mithil Bangera 1
- Yeshil Bangera 1
- Rachel Bawden 1
- Sam Bigeard 1
- Weerayut Buaphet 1
- Laurie Burchell 1
- Lanwenn ar C'horr 1
- Hande Celikkanat 1
- Kranti Chalamalasetti 1
- Leshem Choshen 1
- Thibault Cl\'erice 1
- Vincent Colotte 1
- André Coy 1
- Konstantin Dobler 1
- Karan Dua 1
- Naome A. Etori 1
- Vicky Feliren 1
- Luca Foppiano 1
- Ashi Garg 1
- Dmitry Gaynullin 1
- Manuel Goul\~ao 1
- Tommaso Green 1
- Morgan Grobol 1
- Muhammad Ravi Shulthan Habibi 1
- Nadia Ghezaiel Hammouda 1
- Ikhlasul Akmal Hanif 1
- Nuhu Ibrahim 1
- Inshirah Idris 1
- Fenal Ashokbhai Ilasariya 1
- Joseph Marvin Imperial 1
- Lucence Ing 1
- Amr Keleg 1
- Kun Kerdthaisong 1
- Ilker Kesen 1
- Jun Kevin 1
- Sanjeev Khudanpur 1
- Bruhan Kyomuhendo 1
- Heather Lent 1
- Yiyuan Li 1
- Sarah K. K. Luger 1
- Jean Maillard 1
- Kamohelo Makaaka 1
- Vukosi Marivate 1
- Juan Pablo Martínez 1
- Sara Hincapi\'e Monsalve 1
- Rafael Mosquera 1
- Carol Muchemi 1
- Hasan Muhammad 1
- Shamsuddeen Hassan Muhammad 1
- Ahmad Mustafid 1
- Casper Rufaro Muziri 1
- Hamada Nayel 1
- Khang Nguyen 1
- My Chiffon Nguyen 1
- Melika Nobakhtian 1
- Oriane Nédey 1
- Shu Okabe 1
- Onenamiyi Onesi 1
- Malte Ostendorff 1
- Verrah Akinyi Otiende 1
- Slim Ouni 1
- Quentin Pag\`es 1
- Srikant Panda 1
- Hitesh Laxmichand Patel 1
- Vallerie Alexandra Putra 1
- Ingrid Gabriela Franco Ramirez 1
- Benjamin L Rice 1
- Stephen D. Richardson 1
- Nathaniel Robinson 1
- Nathaniel R. Robinson 1
- Mattes Ruckdeschel 1
- Daniel Ruffinelli 1
- Mostafa Sadeghi 1
- Luis Frentzen Salim 1
- Olanrewaju Samuel 1
- Saron Samuel 1
- Jakhongir Saydaliev 1
- Ammon Shurtz 1
- Pavel Stepachev 1
- Damian Stewart 1
- Matthew Dean Stutzman 1
- Sotaro Takeshita 1
- Vijay Murari Tiyyala 1
- Filbert Aurelian Tjiaranata 1
- Atnafu Lambebo Tonja 1
- Yassine Toughrai 1
- Panagiotis Tsolakis 1
- Gouthami Vadithya 1
- Sowmya Vajjala 1
- Rob Van Der Goot 1
- Thom Vaughan 1
- Emmanuel Vincent 1
- Ahmad Mustapha Wali 1
- Azmine Toushik Wasi 1
- Stefan Watson 1
- Genta Indra Winata 1
- Tack Hwa Wong 1
- Malek Yaich 1
- Andrew Yates 1
- Seid Muhie Yimam 1
- Mike Zhang 1
- Ej Zhou 1