Speech Data from Radio Broadcasts for Low Resource Languages
Bismarck Bamfo Odoom, Leibny Paola Garcia Perera, Prangthip Hansanti, Loic Barrault, Christophe Ropers, Matthew Wiesner, Kenton Murray, Alexandre Mourachko, Philipp Koehn
Abstract
We created a collection of speech data for 48 low resource languages. The corpus is extracted from radio broadcasts and processed with novel speech detection and language identification models based on a manually vetted subset of the audio for 10 languages. The data is made publicly available.- Anthology ID:
- 2024.iwslt-1.18
- Volume:
- Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand (in-person and online)
- Editors:
- Elizabeth Salesky, Marcello Federico, Marine Carpuat
- Venue:
- IWSLT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 134–139
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.iwslt-1.18/
- DOI:
- 10.18653/v1/2024.iwslt-1.18
- Cite (ACL):
- Bismarck Bamfo Odoom, Leibny Paola Garcia Perera, Prangthip Hansanti, Loic Barrault, Christophe Ropers, Matthew Wiesner, Kenton Murray, Alexandre Mourachko, and Philipp Koehn. 2024. Speech Data from Radio Broadcasts for Low Resource Languages. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), pages 134–139, Bangkok, Thailand (in-person and online). Association for Computational Linguistics.
- Cite (Informal):
- Speech Data from Radio Broadcasts for Low Resource Languages (Bamfo Odoom et al., IWSLT 2024)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.iwslt-1.18.pdf