Abstract
We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5.3 million IPA pronunciations, the largest pronunciation lexicon of its kind. This dataset is a unique comparable corpus of IPA pronunciations annotated from multiple sources. We analyze the dataset, noting the presence of machine-generated pronunciations. We develop a novel visualization method to quantify syllabification. We experiment on the new combined task of multilingual IPA syllabification and stress prediction, finding that training a massively multilingual neural sequence-to-sequence model with copy attention can improve performance on both high- and low-resource languages, and multi-task training on stress prediction helps with syllabification.- Anthology ID:
- 2021.bucc-1.9
- Volume:
- Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
- Month:
- September
- Year:
- 2021
- Address:
- Online (Virtual Mode)
- Venue:
- BUCC
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 68–74
- Language:
- URL:
- https://aclanthology.org/2021.bucc-1.9
- DOI:
- Cite (ACL):
- Winston Wu and David Yarowsky. 2021. On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 68–74, Online (Virtual Mode). INCOMA Ltd..
- Cite (Informal):
- On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction (Wu & Yarowsky, BUCC 2021)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2021.bucc-1.9.pdf