Simplified Abugidas

Chenchen Ding, Masao Utiyama, Eiichiro Sumita


Abstract
An abugida is a writing system where the consonant letters represent syllables with a default vowel and other vowels are denoted by diacritics. We investigate the feasibility of recovering the original text written in an abugida after omitting subordinate diacritics and merging consonant letters with similar phonetic values. This is crucial for developing more efficient input methods by reducing the complexity in abugidas. Four abugidas in the southern Brahmic family, i.e., Thai, Burmese, Khmer, and Lao, were studied using a newswire 20,000-sentence dataset. We compared the recovery performance of a support vector machine and an LSTM-based recurrent neural network, finding that the abugida graphemes could be recovered with 94% - 97% accuracy at the top-1 level and 98% - 99% at the top-4 level, even after omitting most diacritics (10 - 30 types) and merging the remaining 30 - 50 characters into 21 graphemes.
Anthology ID:
P18-2078
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
491–495
Language:
URL:
https://aclanthology.org/P18-2078
DOI:
10.18653/v1/P18-2078
Bibkey:
Cite (ACL):
Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. Simplified Abugidas. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 491–495, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Simplified Abugidas (Ding et al., ACL 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/P18-2078.pdf
Presentation:
 P18-2078.Presentation.pdf
Video:
 https://vimeo.com/285804249