Gihan Dias


2020

pdf bib
ThamizhiUDp: A Dependency Parser for Tamil
Kengatharaiyer Sarveswaran | Gihan Dias
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

This paper describes how we developed a neural-based dependency parser, namely ThamizhiUDp, which provides a complete pipeline for the dependency parsing of the Tamil language text using Universal Dependency formalism. We have considered the phases of the dependency parsing pipeline and identified tools and resources in each of these phases to improve the accuracy and to tackle data scarcity. ThamizhiUDp uses Stanza for tokenisation and lemmatisation, ThamizhiPOSt and ThamizhiMorph for generating Part of Speech (POS) and Morphological annotations, and uuparser with multilingual training for dependency parsing. ThamizhiPOSt is our POS tagger, which is based on the Stanza, trained with Amrita POS-tagged corpus. It is the current state-of-the-art in Tamil POS tagging with an F1 score of 93.27. Our morphological analyzer, ThamizhiMorph is a rule-based system with a very good coverage of Tamil. Our dependency parser ThamizhiUDp was trained using multilingual data. It shows a Labelled Assigned Score (LAS) of 62.39, 4 points higher than the current best achieved for Tamil dependency parsing. Therefore, we show that breaking up the dependency parsing pipeline to accommodate existing tools and resources is a viable approach for low-resource languages.

2019

pdf bib
Using Meta-Morph Rules to develop Morphological Analysers: A case study concerning Tamil
Kengatharaiyer Sarveswaran | Gihan Dias | Miriam Butt
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

This paper describes a new and larger coverage Finite-State Morphological Analyser (FSM) and Generator for the Dravidian language Tamil. The FSM has been developed in the context of computational grammar engineering, adhering to the standards of the ParGram effort. Tamil is a morphologically rich language and the interaction between linguistic analysis and formal implementation is complex, resulting in a challenging task. In order to allow the development of the FSM to focus more on the linguistic analysis and less on the formal details, we have developed a system of meta-morph(ology) rules along with a script which translates these rules into FSM processable representations. The introduction of meta-morph rules makes it possible for computationally naive linguists to interact with the system and to expand it in future work. We found that the meta-morph rules help to express linguistic generalisations and reduce the manual effort of writing lexical classes for morphological analysis. Our Tamil FSM currently handles mainly the inflectional morphology of 3,300 verb roots and their 260 forms. Further, it also has a lexicon of approximately 100,000 nouns along with a guesser to handle out-of-vocabulary items. Although the Tamil FSM was primarily developed to be part of a computational grammar, it can also be used as a web or stand-alone application for other NLP tasks, as per general ParGram practice.

2018

pdf bib
Improving domain-specific SMT for low-resourced languages using data from different domains
Fathima Farhath | Pranavan Theivendiram | Surangika Ranathunga | Sanath Jayasena | Gihan Dias
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Sinhala Word Joiner
Rajith Priyanga | Surangika Ranatunga | Gihan Dias
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf bib
Sinhala Short Sentence Similarity Calculation using Corpus-Based and Knowledge-Based Similarity Measures
Jcs Kadupitiya | Surangika Ranathunga | Gihan Dias
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

Currently, corpus based-similarity, string-based similarity, and knowledge-based similarity techniques are used to compare short phrases. However, no work has been conducted on the similarity of phrases in Sinhala language. In this paper, we present a hybrid methodology to compute the similarity between two Sinhala sentences using a Semantic Similarity Measurement technique (corpus-based similarity measurement plus knowledge-based similarity measurement) that makes use of word order information. Since Sinhala WordNet is still under construction, we used lexical resources in performing this semantic similarity calculation. Evaluation using 4000 sentence pairs yielded an average MSE of 0.145 and a Pearson correla-tion factor of 0.832.

pdf bib
Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus
Riyafa Abdul Hameed | Nadeeshani Pathirennehelage | Anusha Ihalapathirana | Maryam Ziyad Mohamed | Surangika Ranathunga | Sanath Jayasena | Gihan Dias | Sandareka Fernando
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

A sentence aligned parallel corpus is an important prerequisite in statistical machine translation. However, manual creation of such a parallel corpus is time consuming, and requires experts fluent in both languages. Automatic creation of a sentence aligned parallel corpus using parallel text is the solution to this problem. In this paper, we present the first ever empirical evaluation carried out to identify the best method to automatically create a sentence aligned Sinhala-Tamil parallel corpus. Annual reports from Sri Lankan government institutions were used as the parallel text for aligning. Despite both Sinhala and Tamil being under-resourced languages, we were able to achieve an F-score value of 0.791 using a hybrid approach that makes use of a bilingual dictionary.

pdf bib
Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala
Sandareka Fernando | Surangika Ranathunga | Sanath Jayasena | Gihan Dias
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

This paper presents a new comprehensive multi-level Part-Of-Speech tag set and a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language. The currently available tag set for Sinhala has two limitations: the unavailability of tags to represent some word classes and the lack of tags to capture inflection based grammatical variations of words. The new tag set, presented in this paper overcomes both of these limitations. The accuracy of available Sinhala Part-Of-Speech taggers, which are based on Hidden Markov Models, still falls far behind state of the art. Our Support Vector Machine based tagger achieved an overall accuracy of 84.68% with 59.86% accuracy for unknown words and 87.12% for known words, when the test set contains 10% of unknown words.

2014

pdf bib
Building a WordNet for Sinhala
Indeewari Wijesiri | Malaka Gallage | Buddhika Gunathilaka | Madhuranga Lakjeewa | Daya Wimalasuriya | Gihan Dias | Rohini Paranavithana | Nisansa de Silva
Proceedings of the Seventh Global Wordnet Conference