Abstract
In recent years there has been great interest in addressing the data scarcity of African languages and providing baseline models for different Natural Language Processing tasks (Orife et al., 2020). Several initiatives (Nekoto et al., 2020) on the continent uses the Bible as a data source to provide proof of concept for some NLP tasks. In this work, we present the Lingala Speech Translation (LiSTra) dataset, release a full pipeline for the construction of such dataset in other languages, and report baselines using both the traditional cascade approach (Automatic Speech Recognition - Machine Translation), and a revolutionary transformer based End-2-End architecture (Liu et al., 2020) with a custom interactive attention that allows information sharing between the recognition decoder and the translation decoder.- Anthology ID:
- 2022.dclrl-1.8
- Volume:
- Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Venue:
- DCLRL
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 63–67
- Language:
- URL:
- https://aclanthology.org/2022.dclrl-1.8
- DOI:
- Cite (ACL):
- Salomon Kabongo Kabenamualu, Vukosi Marivate, and Herman Kamper. 2022. LiSTra Automatic Speech Translation: English to Lingala Case Study. In Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, pages 63–67, Marseille, France. European Language Resources Association.
- Cite (Informal):
- LiSTra Automatic Speech Translation: English to Lingala Case Study (Kabongo Kabenamualu et al., DCLRL 2022)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2022.dclrl-1.8.pdf
- Data
- JW300