Spoken Term Discovery for Language Documentation using Translations

Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, Adam Lopez


Abstract
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
Anthology ID:
W17-4607
Volume:
Proceedings of the Workshop on Speech-Centric Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Nicholas Ruiz, Srinivas Bangalore
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
53–58
Language:
URL:
https://aclanthology.org/W17-4607
DOI:
10.18653/v1/W17-4607
Bibkey:
Cite (ACL):
Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, and Adam Lopez. 2017. Spoken Term Discovery for Language Documentation using Translations. In Proceedings of the Workshop on Speech-Centric Natural Language Processing, pages 53–58, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Spoken Term Discovery for Language Documentation using Translations (Anastasopoulos et al., 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/W17-4607.pdf