On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction

Winston Wu, David Yarowsky


Abstract
We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5.3 million IPA pronunciations, the largest pronunciation lexicon of its kind. This dataset is a unique comparable corpus of IPA pronunciations annotated from multiple sources. We analyze the dataset, noting the presence of machine-generated pronunciations. We develop a novel visualization method to quantify syllabification. We experiment on the new combined task of multilingual IPA syllabification and stress prediction, finding that training a massively multilingual neural sequence-to-sequence model with copy attention can improve performance on both high- and low-resource languages, and multi-task training on stress prediction helps with syllabification.
Anthology ID:
2021.bucc-1.9
Volume:
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Month:
September
Year:
2021
Address:
Online (Virtual Mode)
Venue:
BUCC
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
68–74
Language:
URL:
https://aclanthology.org/2021.bucc-1.9
DOI:
Bibkey:
Cite (ACL):
Winston Wu and David Yarowsky. 2021. On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 68–74, Online (Virtual Mode). INCOMA Ltd..
Cite (Informal):
On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction (Wu & Yarowsky, BUCC 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/2021.bucc-1.9.pdf