Generating a Yiddish Speech Corpus, Forced Aligner and Basic ASR System for the AHEYM Project

Malgorzata Ćavar, Damir Ćavar, Dov-Ber Kerler, Anya Quilitzsch


Abstract
To create automatic transcription and annotation tools for the AHEYM corpus of recorded interviews with Yiddish speakers in Eastern Europe we develop initial Yiddish language resources that are used for adaptations of speech and language technologies. Our project aims at the development of resources and technologies that can make the entire AHEYM corpus and other Yiddish resources more accessible to not only the community of Yiddish speakers or linguists with language expertise, but also historians and experts from other disciplines or the general public. In this paper we describe the rationale behind our approach, the procedures and methods, and challenges that are not specific to the AHEYM corpus, but apply to all documentary language data that is collected in the field. To the best of our knowledge, this is the first attempt to create a speech corpus and speech technologies for Yiddish. This is also the first attempt to work out speech and language technologies to transcribe and translate a large collection of Yiddish spoken language resources.
Anthology ID:
L16-1744
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4688–4693
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/L16-1744/
DOI:
Bibkey:
Cite (ACL):
Malgorzata Ćavar, Damir Ćavar, Dov-Ber Kerler, and Anya Quilitzsch. 2016. Generating a Yiddish Speech Corpus, Forced Aligner and Basic ASR System for the AHEYM Project. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4688–4693, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Generating a Yiddish Speech Corpus, Forced Aligner and Basic ASR System for the AHEYM Project (Ćavar et al., LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/L16-1744.pdf