Spoken Term Discovery for Language Documentation using Translations

Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, Adam Lopez

[How to correct problems with metadata yourself]


Abstract
Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.
Anthology ID:
W17-4607
Volume:
Proceedings of the Workshop on Speech-Centric Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Nicholas Ruiz, Srinivas Bangalore
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
53–58
Language:
URL:
https://aclanthology.org/W17-4607
DOI:
10.18653/v1/W17-4607
Bibkey:
Cite (ACL):
Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, and Adam Lopez. 2017. Spoken Term Discovery for Language Documentation using Translations. In Proceedings of the Workshop on Speech-Centric Natural Language Processing, pages 53–58, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Spoken Term Discovery for Language Documentation using Translations (Anastasopoulos et al., 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/W17-4607.pdf