Abstract
Collecting voice resources for speech recognition systems is a multifaceted challenge, involving legal, technical, and diversity considerations. However, it is crucial to ensure fair access to voice-driven technology across diverse linguistic backgrounds. We describe an ongoing effort to create an extensive, high-quality, publicly available voice dataset for future development of speech technologies in Catalan through the Mozilla Common Voice crowd-sourcing platform. We detail the specific approaches used to address the challenges faced in recruiting contributors and managing the collection, validation, and recording of sentences. This detailed overview can serve as a source of guidance for similar initiatives across other projects and linguistic contexts. The success of this project is evident in the latest corpus release, version 16.1, where Catalan ranks as the most prominent language in the corpus, both in terms of recorded hours and when considering validated hours. This establishes Catalan as a language with significant speech resources for language technology development and significantly raises its international visibility.- Anthology ID:
- 2024.lrec-main.193
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 2142–2148
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/2024.lrec-main.193/
- DOI:
- Cite (ACL):
- Carme Armentano-Oller, Montserrat Marimon, and Marta Villegas. 2024. Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2142–2148, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus (Armentano-Oller et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/2024.lrec-main.193.pdf