Abstract
An abugida is a writing system where the consonant letters represent syllables with a default vowel and other vowels are denoted by diacritics. We investigate the feasibility of recovering the original text written in an abugida after omitting subordinate diacritics and merging consonant letters with similar phonetic values. This is crucial for developing more efficient input methods by reducing the complexity in abugidas. Four abugidas in the southern Brahmic family, i.e., Thai, Burmese, Khmer, and Lao, were studied using a newswire 20,000-sentence dataset. We compared the recovery performance of a support vector machine and an LSTM-based recurrent neural network, finding that the abugida graphemes could be recovered with 94% - 97% accuracy at the top-1 level and 98% - 99% at the top-4 level, even after omitting most diacritics (10 - 30 types) and merging the remaining 30 - 50 characters into 21 graphemes.- Anthology ID:
- P18-2078
- Volume:
- Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
- Month:
- July
- Year:
- 2018
- Address:
- Melbourne, Australia
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 491–495
- Language:
- URL:
- https://aclanthology.org/P18-2078
- DOI:
- 10.18653/v1/P18-2078
- Cite (ACL):
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. Simplified Abugidas. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 491–495, Melbourne, Australia. Association for Computational Linguistics.
- Cite (Informal):
- Simplified Abugidas (Ding et al., ACL 2018)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/P18-2078.pdf