Nanda Muhammad
2022
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
Search
Fix author
Co-authors
- Mofetoluwa Adeyemi 1
- Sweta Agrawal 1
- Orevaoghene Ahia 1
- Oghenefego Ahia 1
- Duygu Ataman 1
- Ayodele Awokoya 1
- Israel Abebe Azime 1
- Pallavi Baljekar 1
- Ankur Bapna 1
- Ahmed Baruwa 1
- Alessia Battisti 1
- Stella Biderman 1
- Isaac Caswell 1
- Nisansa De Silva 1
- Sakhile Dlamini 1
- Bonaventure F. P. Dossou 1
- Orhan Firat 1
- Mathias Jenny 1
- Yacine Jernite 1
- Julia Kreutzer 1
- Sneha Kudugunta 1
- Nze Lawson 1
- Colin Leong 1
- Tapiwanashe Matangira 1
- Jamshidbek Mirzakhalov 1
- Ayanda Mnyakeni 1
- Shamsuddeen Hassan Muhammad 1
- Mathias Müller 1
- André Müller 1
- Toan Q. Nguyen 1
- Kelechi Ogueji 1
- Iroro Orife 1
- Pedro Ortiz Suarez 1
- Salomey Osei 1
- Isabel Papadimitriou 1
- Annette Rios Gonzales 1
- Clara Rivera 1
- Andre Niyongabo Rubungo 1
- Benoît Sagot 1
- Sokhar Samb 1
- Supheakmungkol Sarin 1
- Monang Setyawan 1
- Claytone Sikasote 1
- Artem Sokolov 1
- Nishant Subramani 1
- Allahsera Tapo 1
- Nasanbayar Ulzii-Orshikh 1
- Ahsan Wahab 1
- Lisa Wang 1
- Daan van Esch 1
- Sakine Çabuk Ballı 1
Venues
- tacl1