Clara Rivera


2022

pdf
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

pdf
Writing System and Speaker Metadata for 2,800+ Language Varieties
Daan van Esch | Tamar Lucassen | Sebastian Ruder | Isaac Caswell | Clara Rivera
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We describe an open-source dataset providing metadata for about 2,800 language varieties used in the world today. Specifically, the dataset provides the attested writing system(s) for each of these 2,800+ varieties, as well as an estimated speaker count for each variety. This dataset was developed through internal research and has been used for analyses around language technologies. This is the largest publicly-available, machine-readable resource with writing system and speaker information for the world’s languages. We analyze the distribution of languages and writing systems in our data and compare it to their representation in current NLP. We hope the availability of this data will catalyze research in under-represented languages.

2020

pdf
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
Oddur Kjartansson | Alexander Gutkin | Alena Butryna | Isin Demirsahin | Clara Rivera
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This paper introduces new open speech datasets for three of the languages of Spain: Basque, Catalan and Galician. Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics.

pdf
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang | Bo Pang | Zhenhai Zhu | Clara Rivera | Radu Soricut
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

pdf
Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Fei He | Shan-Hui Cathy Chu | Oddur Kjartansson | Clara Rivera | Anna Katanova | Alexander Gutkin | Isin Demirsahin | Cibu Johny | Martin Jansche | Supheakmungkol Sarin | Knot Pipatsrisawat
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers. The datasets are primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring data for other languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) > 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help in the progress of speech applications for the languages described and aid corpora development for other, smaller, languages of India and beyond.

pdf
Open-source Multi-speaker Corpora of the English Accents in the British Isles
Isin Demirsahin | Oddur Kjartansson | Alexander Gutkin | Clara Rivera
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation, which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.