Spoken Term Discovery for Language Documentation using Translations
Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, Adam Lopez
Abstract
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.- Anthology ID:
- W17-4607
- Volume:
- Proceedings of the Workshop on Speech-Centric Natural Language Processing
- Month:
- September
- Year:
- 2017
- Address:
- Copenhagen, Denmark
- Editors:
- Nicholas Ruiz, Srinivas Bangalore
- Venue:
- WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 53–58
- Language:
- URL:
- https://aclanthology.org/W17-4607
- DOI:
- 10.18653/v1/W17-4607
- Cite (ACL):
- Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, and Adam Lopez. 2017. Spoken Term Discovery for Language Documentation using Translations. In Proceedings of the Workshop on Speech-Centric Natural Language Processing, pages 53–58, Copenhagen, Denmark. Association for Computational Linguistics.
- Cite (Informal):
- Spoken Term Discovery for Language Documentation using Translations (Anastasopoulos et al., 2017)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/W17-4607.pdf