Massively Multilingual Pronunciation Modeling with WikiPron
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, Kyle Gorman
Abstract
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.- Anthology ID:
- 2020.lrec-1.521
- Original:
- 2020.lrec-1.521v1
- Version 2:
- 2020.lrec-1.521v2
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4223–4228
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.521
- DOI:
- Cite (ACL):
- Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman. 2020. Massively Multilingual Pronunciation Modeling with WikiPron. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4223–4228, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Massively Multilingual Pronunciation Modeling with WikiPron (Lee et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2020.lrec-1.521.pdf