Massively Multilingual Pronunciation Modeling with WikiPron

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, Kyle Gorman


Abstract
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
Anthology ID:
2020.lrec-1.521
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4223–4228
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.521
DOI:
Bibkey:
Cite (ACL):
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman. 2020. Massively Multilingual Pronunciation Modeling with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223–4228, Marseille, France. European Language Resources Association.
Cite (Informal):
Massively Multilingual Pronunciation Modeling with WikiPron (Lee et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2020.lrec-1.521.pdf