Preparation of Bangla Speech Corpus from Publicly Available Audio & Text
Shafayat Ahmed, Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan, Mohammad Zuberul Islam
Abstract
Automatic speech recognition systems require large annotated speech corpus. The manual annotation of a large corpus is very difficult. In this paper, we focus on the automatic preparation of a speech corpus for Bangladeshi Bangla. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. We have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. We also have prepared a synthetic speech corpus for handling out-of-vocabulary word problems in Bangla language. Our corpus is suitable for training with Kaldi. Experimental results show that the use of our corpus in addition to the Google Speech corpus (229 hours) significantly improves the performance of the ASR system.- Anthology ID:
- 2020.lrec-1.811
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6586–6592
- Language:
- English
- URL:
- https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2020.lrec-1.811/
- DOI:
- Cite (ACL):
- Shafayat Ahmed, Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan, and Mohammad Zuberul Islam. 2020. Preparation of Bangla Speech Corpus from Publicly Available Audio & Text. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6586–6592, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Preparation of Bangla Speech Corpus from Publicly Available Audio & Text (Ahmed et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2020.lrec-1.811.pdf