A Systematic Approach to Derive a Refined Speech Corpus for Sinhala

Disura Warusawithana, Nilmani Kulaweera, Lakshan Weerasinghe, Buddhika Karunarathne


Abstract
Speech Recognition is an active research area where advances of technology have continuously driven the development of research work. However, due to the lack of adequate resources, certain languages such as Sinhala, are left to underutilize the technology. With techniques such as crowdsourcing and web scraping, several Sinhala corpora have been created and made publicly available. Despite them being large and generic, the correctness and consistency in their text data remain questionable, especially due to the lack of uniformity in the language used in the different sources of web scraped text. Addressing that requires a thorough understanding of technical and linguistic particulars pertaining to the language, which often leaves the issue unattended. We have followed a systematic approach to derive a refined corpus using a publicly available corpus for Sinhala speech recognition. In particular, we standardized the transcriptions of the corpus by removing noise in the text. Further, we applied corrections based on Sinhala linguistics. A comparative experiment shows a promising effect of the linguistic corrections by having a relative reduction of the Word-Error-Rate by 15.9%.
Anthology ID:
2022.lrec-1.546
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5107–5113
Language:
URL:
https://aclanthology.org/2022.lrec-1.546
DOI:
Bibkey:
Cite (ACL):
Disura Warusawithana, Nilmani Kulaweera, Lakshan Weerasinghe, and Buddhika Karunarathne. 2022. A Systematic Approach to Derive a Refined Speech Corpus for Sinhala. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5107–5113, Marseille, France. European Language Resources Association.
Cite (Informal):
A Systematic Approach to Derive a Refined Speech Corpus for Sinhala (Warusawithana et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.lrec-1.546.pdf