2023
pdf
abs
Findings of the SIGTYP 2023 Shared task on Cognate and Derivative Detection For Low-Resourced Languages
Priya Rani
|
Koustava Goswami
|
Adrian Doyle
|
Theodorus Fransen
|
Bernardo Stearns
|
John P. McCrae
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper describes the structure and findings of the SIGTYP 2023 shared task on cognate and derivative detection for low-resourced languages, broken down into a supervised and unsupervised sub-task. The participants were asked to submit the test data’s final prediction. A total of nine teams registered for the shared task where seven teams registered for both sub-tasks. Only two participants ended up submitting system descriptions, with only one submitting systems for both sub-tasks. While all systems show a rather promising performance, all could be within the baseline score for the supervised sub-task. However, the system submitted for the unsupervised sub-task outperforms the baseline score.
pdf
The Cardamom Workbench for Historical and Under-Resourced Languages
Adrian Doyle
|
Theodorus Fransen
|
Bernardo Stearns
|
John P. McCrae
|
Oksana Dereza
|
Priya Rani
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
abs
Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages
Koustava Goswami
|
Priya Rani
|
Theodorus Fransen
|
John McCrae
Findings of the Association for Computational Linguistics: EMNLP 2023
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform for most under-resourced languages. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages using morphological knowledge from closely related languages. We train an encoder to gain morphological knowledge of a language and transfer the knowledge to perform unsupervised and weakly-supervised cognate detection tasks with and without the pivot language for the closely-related languages. While unsupervised, it overcomes the need for hand-crafted annotation of cognates. We performed experiments on different published cognate detection datasets across language families and observed not only significant improvement over the state-of-the-art but also our method outperformed the state-of-the-art supervised and unsupervised methods. Our model can be extended to a wide range of languages from any language family as it overcomes the requirement of the annotation of the cognate pairs for training.
2022
pdf
abs
MHE: Code-Mixed Corpora for Similar Language Identification
Priya Rani
|
John P. McCrae
|
Theodorus Fransen
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper introduces a new Magahi-Hindi-English (MHE) code-mixed data-set for similar language identification (SMLID), where Magahi is a less-resourced minority language. This corpus provides a language id at two levels: word and sentence. This data-set is the first Magahi-Hindi-English code-mixed data-set for similar language identification task. Furthermore, we will discuss the complexity of the data-set and provide a few baselines for the language identification task.
2021
pdf
abs
ULD-NUIG at Social Media Mining for Health Applications (#SMM4H) Shared Task 2021
Atul Kr. Ojha
|
Priya Rani
|
Koustava Goswami
|
Bharathi Raja Chakravarthi
|
John P. McCrae
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
Social media platforms such as Twitter and Facebook have been utilised for various research studies, from the cohort-level discussion to community-driven approaches to address the challenges in utilizing social media data for health, clinical and biomedical information. Detection of medical jargon’s, named entity recognition, multi-word expression becomes the primary, fundamental steps in solving those challenges. In this paper, we enumerate the ULD-NUIG team’s system, designed as part of Social Media Mining for Health Applications (#SMM4H) Shared Task 2021. The team conducted a series of experiments to explore the challenges of task 6 and task 5. The submitted systems achieve F-1 0.84 and 0.53 score for task 6 and 5 respectively.
2020
pdf
abs
NUIG-Panlingua-KMI Hindi-Marathi MT Systems for Similar Language Translation Task @ WMT 2020
Atul Kr. Ojha
|
Priya Rani
|
Akanksha Bansal
|
Bharathi Raja Chakravarthi
|
Ritesh Kumar
|
John P. McCrae
Proceedings of the Fifth Conference on Machine Translation
NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in Similar Language Translation Task for Hindi↔Marathi language pair. As part of these efforts, we conducteda series of experiments to address the challenges for translation between similar languages. Among the 4 MT systems prepared under this task, 1 PBSMT systems were prepared for Hindi↔Marathi each and 1 NMT systems were developed for Hindi↔Marathi using Byte PairEn-coding (BPE) into subwords. The results show that different architectures NMT could be an effective method for developing MT systems for closely related languages. Our Hindi-Marathi NMT system was ranked 8th among the 14 teams that participated and our Marathi-Hindi NMT system was ranked 8th among the 11 teams participated for the task.
pdf
abs
A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed Data
Priya Rani
|
Shardul Suryawanshi
|
Koustava Goswami
|
Bharathi Raja Chakravarthi
|
Theodorus Fransen
|
John Philip McCrae
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Hate speech detection in social media communication has become one of the primary concerns to avoid conflicts and curb undesired activities. In an environment where multilingual speakers switch among multiple languages, hate speech detection becomes a challenging task using methods that are designed for monolingual corpora. In our work, we attempt to analyze, detect and provide a comparative study of hate speech in a code-mixed social media text. We also provide a Hindi-English code-mixed data set consisting of Facebook and Twitter posts and comments. Our experiments show that deep learning models trained on this code-mixed corpus perform better.
pdf
abs
ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text
Koustava Goswami
|
Priya Rani
|
Bharathi Raja Chakravarthi
|
Theodorus Fransen
|
John P. McCrae
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name “koustava” on the “Sentimix Hindi English” page.
2019
pdf
abs
KMI-Coling at SemEval-2019 Task 6: Exploring N-grams for Offensive Language detection
Priya Rani
|
Atul Kr. Ojha
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper, we present the system description of Offensive language detection tool which is developed by the KMI_Coling under the OffensEval Shared task. The OffensEval Shared Task was conducted in SemEval 2019 workshop. To develop the system, we have explored n-grams up to 8-gram and trained three different namely A, B and C systems for three different subtasks within the OffensEval task which achieves 79.76%, 87.91% and 44.37% accuracy respectively. The task was completed using the dataset provided to us by the OffensEval organisers was the part of OLID dataset. It consists of 13,240 tweets extracted from twitter and were annotated at three levels using crowdsourcing.
pdf
abs
Panlingua-KMI MT System for Similar Language Translation Task at WMT 2019
Atul Kr. Ojha
|
Ritesh Kumar
|
Akanksha Bansal
|
Priya Rani
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
The present paper enumerates the development of Panlingua-KMI Machine Translation (MT) systems for Hindi ↔ Nepali language pair, designed as part of the Similar Language Translation Task at the WMT 2019 Shared Task. The Panlingua-KMI team conducted a series of experiments to explore both the phrase-based statistical (PBSMT) and neural methods (NMT). Among the 11 MT systems prepared under this task, 6 PBSMT systems were prepared for Nepali-Hindi, 1 PBSMT for Hindi-Nepali and 2 NMT systems were developed for Nepali↔Hindi. The results show that PBSMT could be an effective method for developing MT systems for closely-related languages. Our Hindi-Nepali PBSMT system was ranked 2nd among the 13 systems submitted for the pair and our Nepali-Hindi PBSMTsystem was ranked 4th among the 12 systems submitted for the task.