Automatic Transcription Challenges for Inuktitut, a Low-Resource Polysynthetic Language

Vishwa Gupta, Gilles Boulianne


Abstract
We introduce the first attempt at automatic speech recognition (ASR) in Inuktitut, as a representative for polysynthetic, low-resource languages, like many of the 900 Indigenous languages spoken in the Americas. As most previous work on Inuktitut, we use texts from parliament proceedings, but in addition we have access to 23 hours of transcribed oral stories. With this corpus, we show that Inuktitut displays a much higher degree of polysynthesis than other agglutinative languages usually considered in ASR, such as Finnish or Turkish. Even with a vocabulary of 1.3 million words derived from proceedings and stories, held-out stories have more than 60% of words out-of-vocabulary. We train bi-directional LSTM acoustic models, then investigate word and subword units, morphemes and syllables, and a deep neural network that finds word boundaries in subword sequences. We show that acoustic decoding using syllables decorated with word boundary markers results in the lowest word error rate.
Anthology ID:
2020.lrec-1.307
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2521–2527
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.307
DOI:
Bibkey:
Cite (ACL):
Vishwa Gupta and Gilles Boulianne. 2020. Automatic Transcription Challenges for Inuktitut, a Low-Resource Polysynthetic Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2521–2527, Marseille, France. European Language Resources Association.
Cite (Informal):
Automatic Transcription Challenges for Inuktitut, a Low-Resource Polysynthetic Language (Gupta & Boulianne, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.307.pdf