2024
pdf
abs
Findings of the WILDRE Shared Task on Code-mixed Less-resourced Sentiment Analysis for Indo-Aryan Languages
Priya Rani
|
Gaurav Negi
|
Saroj Jha
|
Shardul Suryawanshi
|
Atul Kr. Ojha
|
Paul Buitelaar
|
John P. McCrae
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
This paper describes the structure and findings of the WILDRE 2024 shared task on Code-mixed Less-resourced Sentiment Analysis for Indo-Aryan Languages. The participants were asked to submit the test data’s final prediction on CodaLab. A total of fourteen teams registered for the shared task. Only four participants submitted the system for evaluation on CodaLab, with only two teams submitting the system description paper. While all systems show a rather promising performance, they outperform the baseline scores.
pdf
abs
MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis
Priya Rani
|
Theodorus Fransen
|
John P. McCrae
|
Gaurav Negi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The present paper introduces new sentiment data, MaCMS, for Magahi-Hindi-English (MHE) code-mixed language, where Magahi is a less-resourced minority language. This dataset is the first Magahi-Hindi-English code-mixed dataset for sentiment analysis tasks. Further, we also provide a linguistics analysis of the dataset to understand the structure of code-mixing and a statistical study to understand the language preferences of speakers with different polarities. With these analyses, we also train baseline models to evaluate the dataset’s quality.
pdf
bib
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Michael Hahn
|
Alexey Sorokin
|
Ritesh Kumar
|
Andreas Shcherbakov
|
Yulia Otmakhova
|
Jinrui Yang
|
Oleg Serikov
|
Priya Rani
|
Edoardo M. Ponti
|
Saliha Muradoğlu
|
Rena Gao
|
Ryan Cotterell
|
Ekaterina Vylomova
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
pdf
abs
Findings of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages
Oksana Dereza
|
Adrian Doyle
|
Priya Rani
|
Atul Kr. Ojha
|
Pádraic Moran
|
John McCrae
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper discusses the organisation and findings of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. The shared task was split into the constrained and unconstrained tracks and involved solving either 3 or 5 problems for either 13 or 16 ancient and historical languages belonging to 4 language families, and making use of 6 different scripts. There were 14 registrations in total, of which 3 teams submitted to each track. Out of these 6 submissions, 2 systems were successful in the constrained setting and another 2 in the uncon- strained setting, and 4 system description papers were submitted by different teams. The best average result for morphological feature prediction was about 96%, while the best average results for POS-tagging and lemmatisation were 96% and 94% respectively. At the word level, the winning team could not achieve a higher average accuracy across all 16 languages than 5.95%, which demonstrates the difficulty of this problem. At the character level, the best average result over 16 languages 55.62%
pdf
abs
CHAMUÇA: Towards a Linked Data Language Resource of Portuguese Borrowings in Asian Languages
Fahad Khan
|
Ana Salgado
|
Isuri Anuradha
|
Rute Costa
|
Chamila Liyanage
|
John P. McCrae
|
Atul Kumar Ojha
|
Priya Rani
|
Francesca Frontini
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
This paper presents the development of CHAMUÇA, a novel lexical resource designed to document the influence of the Portuguese language on various Asian languages, with an initial focus on the languages of South Asia. Through the utilization of linked open data and the OntoLex vocabulary, CHAMUÇA offers structured insights into the linguistic characteristics, and cultural ramifications of Portuguese borrowings across multiple languages. The article outlines CHAMUÇA’s potential contributions to the linguistic linked data community, emphasising its role in addressing the scarcity of resources for lesser-resourced languages and serving as a test case for organising etymological data in a queryable format. CHAMUÇA emerges as an initiative towards the comprehensive catalogization and analysis of Portuguese borrowings, offering valuable insights into language contact dynamics, historical evolution, and cultural exchange in Asia, one that is based on linked data technology.
pdf
abs
Teanga Data Model for Linked Corpora
John P. McCrae
|
Priya Rani
|
Adrian Doyle
|
Bernardo Stearns
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.
2023
pdf
The Cardamom Workbench for Historical and Under-Resourced Languages
Adrian Doyle
|
Theodorus Fransen
|
Bernardo Stearns
|
John P. McCrae
|
Oksana Dereza
|
Priya Rani
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
abs
Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages
Koustava Goswami
|
Priya Rani
|
Theodorus Fransen
|
John McCrae
Findings of the Association for Computational Linguistics: EMNLP 2023
Exploiting cognates for transfer learning in under-resourced languages is an exciting opportunity for language understanding tasks, including unsupervised machine translation, named entity recognition and information retrieval. Previous approaches mainly focused on supervised cognate detection tasks based on orthographic, phonetic or state-of-the-art contextual language models, which under-perform for most under-resourced languages. This paper proposes a novel language-agnostic weakly-supervised deep cognate detection framework for under-resourced languages using morphological knowledge from closely related languages. We train an encoder to gain morphological knowledge of a language and transfer the knowledge to perform unsupervised and weakly-supervised cognate detection tasks with and without the pivot language for the closely-related languages. While unsupervised, it overcomes the need for hand-crafted annotation of cognates. We performed experiments on different published cognate detection datasets across language families and observed not only significant improvement over the state-of-the-art but also our method outperformed the state-of-the-art supervised and unsupervised methods. Our model can be extended to a wide range of languages from any language family as it overcomes the requirement of the annotation of the cognate pairs for training.
pdf
abs
Findings of the SIGTYP 2023 Shared task on Cognate and Derivative Detection For Low-Resourced Languages
Priya Rani
|
Koustava Goswami
|
Adrian Doyle
|
Theodorus Fransen
|
Bernardo Stearns
|
John P. McCrae
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper describes the structure and findings of the SIGTYP 2023 shared task on cognate and derivative detection for low-resourced languages, broken down into a supervised and unsupervised sub-task. The participants were asked to submit the test data’s final prediction. A total of nine teams registered for the shared task where seven teams registered for both sub-tasks. Only two participants ended up submitting system descriptions, with only one submitting systems for both sub-tasks. While all systems show a rather promising performance, all could be within the baseline score for the supervised sub-task. However, the system submitted for the unsupervised sub-task outperforms the baseline score.
2022
pdf
abs
MHE: Code-Mixed Corpora for Similar Language Identification
Priya Rani
|
John P. McCrae
|
Theodorus Fransen
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper introduces a new Magahi-Hindi-English (MHE) code-mixed data-set for similar language identification (SMLID), where Magahi is a less-resourced minority language. This corpus provides a language id at two levels: word and sentence. This data-set is the first Magahi-Hindi-English code-mixed data-set for similar language identification task. Furthermore, we will discuss the complexity of the data-set and provide a few baselines for the language identification task.
2021
pdf
abs
ULD-NUIG at Social Media Mining for Health Applications (#SMM4H) Shared Task 2021
Atul Kr. Ojha
|
Priya Rani
|
Koustava Goswami
|
Bharathi Raja Chakravarthi
|
John P. McCrae
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
Social media platforms such as Twitter and Facebook have been utilised for various research studies, from the cohort-level discussion to community-driven approaches to address the challenges in utilizing social media data for health, clinical and biomedical information. Detection of medical jargon’s, named entity recognition, multi-word expression becomes the primary, fundamental steps in solving those challenges. In this paper, we enumerate the ULD-NUIG team’s system, designed as part of Social Media Mining for Health Applications (#SMM4H) Shared Task 2021. The team conducted a series of experiments to explore the challenges of task 6 and task 5. The submitted systems achieve F-1 0.84 and 0.53 score for task 6 and 5 respectively.
2020
pdf
abs
ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text
Koustava Goswami
|
Priya Rani
|
Bharathi Raja Chakravarthi
|
Theodorus Fransen
|
John P. McCrae
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name “koustava” on the “Sentimix Hindi English” page.
pdf
abs
A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed Data
Priya Rani
|
Shardul Suryawanshi
|
Koustava Goswami
|
Bharathi Raja Chakravarthi
|
Theodorus Fransen
|
John Philip McCrae
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Hate speech detection in social media communication has become one of the primary concerns to avoid conflicts and curb undesired activities. In an environment where multilingual speakers switch among multiple languages, hate speech detection becomes a challenging task using methods that are designed for monolingual corpora. In our work, we attempt to analyze, detect and provide a comparative study of hate speech in a code-mixed social media text. We also provide a Hindi-English code-mixed data set consisting of Facebook and Twitter posts and comments. Our experiments show that deep learning models trained on this code-mixed corpus perform better.
pdf
abs
NUIG-Panlingua-KMI Hindi-Marathi MT Systems for Similar Language Translation Task @ WMT 2020
Atul Kr. Ojha
|
Priya Rani
|
Akanksha Bansal
|
Bharathi Raja Chakravarthi
|
Ritesh Kumar
|
John P. McCrae
Proceedings of the Fifth Conference on Machine Translation
NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in Similar Language Translation Task for Hindi↔Marathi language pair. As part of these efforts, we conducteda series of experiments to address the challenges for translation between similar languages. Among the 4 MT systems prepared under this task, 1 PBSMT systems were prepared for Hindi↔Marathi each and 1 NMT systems were developed for Hindi↔Marathi using Byte PairEn-coding (BPE) into subwords. The results show that different architectures NMT could be an effective method for developing MT systems for closely related languages. Our Hindi-Marathi NMT system was ranked 8th among the 14 teams that participated and our Marathi-Hindi NMT system was ranked 8th among the 11 teams participated for the task.
2019
pdf
abs
KMI-Coling at SemEval-2019 Task 6: Exploring N-grams for Offensive Language detection
Priya Rani
|
Atul Kr. Ojha
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper, we present the system description of Offensive language detection tool which is developed by the KMI_Coling under the OffensEval Shared task. The OffensEval Shared Task was conducted in SemEval 2019 workshop. To develop the system, we have explored n-grams up to 8-gram and trained three different namely A, B and C systems for three different subtasks within the OffensEval task which achieves 79.76%, 87.91% and 44.37% accuracy respectively. The task was completed using the dataset provided to us by the OffensEval organisers was the part of OLID dataset. It consists of 13,240 tweets extracted from twitter and were annotated at three levels using crowdsourcing.
pdf
abs
Panlingua-KMI MT System for Similar Language Translation Task at WMT 2019
Atul Kr. Ojha
|
Ritesh Kumar
|
Akanksha Bansal
|
Priya Rani
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
The present paper enumerates the development of Panlingua-KMI Machine Translation (MT) systems for Hindi ↔ Nepali language pair, designed as part of the Similar Language Translation Task at the WMT 2019 Shared Task. The Panlingua-KMI team conducted a series of experiments to explore both the phrase-based statistical (PBSMT) and neural methods (NMT). Among the 11 MT systems prepared under this task, 6 PBSMT systems were prepared for Nepali-Hindi, 1 PBSMT for Hindi-Nepali and 2 NMT systems were developed for Nepali↔Hindi. The results show that PBSMT could be an effective method for developing MT systems for closely-related languages. Our Hindi-Nepali PBSMT system was ranked 2nd among the 13 systems submitted for the pair and our Nepali-Hindi PBSMTsystem was ranked 4th among the 12 systems submitted for the task.