Aloka Fernando


2021

pdf
Data Augmentation to Address Out of VocabularyProblem in Low Resource Sinhala English Neural Machine Translation
Aloka Fernando | Surangika Ranathunga
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Building a Linguistic Resource : A Word Frequency List for Sinhala
Aloka Fernando | Gihan Dias
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

A word frequency list is a list of unique words in a language along with their frequency count. It is generally sorted by frequency. Such a list is essential for many NLP tasks, including building language models, POS taggers, spelling checkers, word separation guides, etc., in addition to assisting language learners. Such lists are available for many languages, but a large-scale word list is still not available for Sinhala. We have developed a comprehensive list of words, together with their frequency and part-of-speech (POS), from a large textbase. Unlike many other such lists, our list includes a large number of low-frequency words (many of which are erroneous), which enables the analysis of such words, including the frequencies of errors. In addition to the main list, we have also prepared a list of linguistically verified words. The word frequency list and the verified word list are the largest collections of words lists that are available for the Sinhala language.