Atul Kr. Ojha

Also published as: Atul Ku. Ojha


2021

pdf bib
ULD-NUIG at Social Media Mining for Health Applications (#SMM4H) Shared Task 2021
Atul Kr. Ojha | Priya Rani | Koustava Goswami | Bharathi Raja Chakravarthi | John P. McCrae
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

Social media platforms such as Twitter and Facebook have been utilised for various research studies, from the cohort-level discussion to community-driven approaches to address the challenges in utilizing social media data for health, clinical and biomedical information. Detection of medical jargon’s, named entity recognition, multi-word expression becomes the primary, fundamental steps in solving those challenges. In this paper, we enumerate the ULD-NUIG team’s system, designed as part of Social Media Mining for Health Applications (#SMM4H) Shared Task 2021. The team conducted a series of experiments to explore the challenges of task 6 and task 5. The submitted systems achieve F-1 0.84 and 0.53 score for task 6 and 5 respectively.

pdf bib
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
John Ortega | Atul Kr. Ojha | Katharina Kann | Chao-Hong Liu
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

pdf bib
Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages
Atul Kr. Ojha | Chao-Hong Liu | Katharina Kann | John Ortega | Sheetal Shatam | Theodorus Fransen
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following directions: English↔Irish, English↔Marathi, and Taiwanese Sign language↔Traditional Chinese. Training data consists of 8112, 20933 and 128608 segments, respectively. There are additional monolingual data sets for Marathi and English that consist of 21901 segments. The results presented here are based on entries from a total of eight teams. Three teams submitted systems for English↔Irish while five teams submitted systems for English↔Marathi. Unfortunately, there were no systems submissions for the Taiwanese Sign language↔Traditional Chinese task. Maximum system performance was computed using BLEU and follow as 36.0 for English–Irish, 34.6 for Irish–English, 24.2 for English–Marathi, and 31.3 for Marathi–English.

pdf bib
Few-shot and Zero-shot Approaches to Legal Text Classification: A Case Study in the Financial Sector
Rajdeep Sarkar | Atul Kr. Ojha | Jay Megaro | John Mariano | Vall Herard | John P. McCrae
Proceedings of the Natural Legal Language Processing Workshop 2021

The application of predictive coding techniques to legal texts has the potential to greatly reduce the cost of legal review of documents, however, there is such a wide array of legal tasks and continuously evolving legislation that it is hard to construct sufficient training data to cover all cases. In this paper, we investigate few-shot and zero-shot approaches that require substantially less training data and introduce a triplet architecture, which for promissory statements produces performance close to that of a supervised system. This method allows predictive coding methods to be rapidly developed for new regulations and markets.

2020

pdf bib
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Ritesh Kumar | Atul Kr. Ojha | Bornini Lahiri | Marcos Zampieri | Shervin Malmasi | Vanessa Murdock | Daniel Kadar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

pdf bib
Evaluating Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In this paper, we present the report and findings of the Shared Task on Aggression and Gendered Aggression Identification organised as part of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC - 2) at LREC 2020. The task consisted of two sub-tasks - aggression identification (sub-task A) and gendered identification (sub-task B) - in three languages - Bangla, Hindi and English. For this task, the participants were provided with a dataset of approximately 5,000 instances from YouTube comments in each language. For testing, approximately 1,000 instances were provided in each language for each sub-task. A total of 70 teams registered to participate in the task and 19 teams submitted their test runs. The best system obtained a weighted F-score of approximately 0.80 in sub-task A for all the three languages. While approximately 0.87 in sub-task B for all the three languages.

pdf bib
Developing a Multilingual Annotated Corpus of Misogyny and Aggression
Shiladitya Bhattacharya | Siddharth Singh | Ritesh Kumar | Akanksha Bansal | Akash Bhagat | Yogesh Dawer | Bornini Lahiri | Atul Kr. Ojha
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

pdf bib
KMI-Panlingua-IITKGP @SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features
Ritesh Kumar | Deepak Alok | Akanksha Bansal | Bornini Lahiri | Atul Kr. Ojha
Proceedings of the Second Workshop on Computational Research in Linguistic Typology

This paper enumerates SigTyP 2020 Shared Task on the prediction of typological features as performed by the KMI-Panlingua-IITKGP team. The task entailed the prediction of missing values in a particular language, provided, the name of the language family, its genus, location (in terms of latitude and longitude coordinates and name of the country where it is spoken) and a set of feature-value pair are available. As part of fulfillment of the aforementioned task, the team submitted 3 kinds of system - 2 rule-based and one hybrid system. Of these 3, one rule-based system generated the best performance on the test set. All the systems were ‘constrained’ in the sense that no additional dataset or information, other than those provided by the organisers, was used for developing the systems.

pdf bib
NUIG-Panlingua-KMI Hindi-Marathi MT Systems for Similar Language Translation Task @ WMT 2020
Atul Kr. Ojha | Priya Rani | Akanksha Bansal | Bharathi Raja Chakravarthi | Ritesh Kumar | John P. McCrae
Proceedings of the Fifth Conference on Machine Translation

NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in Similar Language Translation Task for Hindi↔Marathi language pair. As part of these efforts, we conducteda series of experiments to address the challenges for translation between similar languages. Among the 4 MT systems prepared under this task, 1 PBSMT systems were prepared for Hindi↔Marathi each and 1 NMT systems were developed for Hindi↔Marathi using Byte PairEn-coding (BPE) into subwords. The results show that different architectures NMT could be an effective method for developing MT systems for closely related languages. Our Hindi-Marathi NMT system was ranked 8th among the 14 teams that participated and our Marathi-Hindi NMT system was ranked 8th among the 11 teams participated for the task.

pdf bib
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jade Abbott | John Ortega | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Findings of the LoResMT 2020 Shared Task on Zero-Shot for Low-Resource languages
Atul Kr. Ojha | Valentin Malykh | Alina Karakanta | Chao-Hong Liu
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

This paper presents the findings of the LoResMT 2020 Shared Task on zero-shot translation for low resource languages. This task was organised as part of the 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT) at AACL-IJCNLP 2020. The focus was on the zero-shot approach as a notable development in Neural Machine Translation to build MT systems for language pairs where parallel corpora are small or even non-existent. The shared task experience suggests that back-translation and domain adaptation methods result in better accuracy for small-size datasets. We further noted that, although translation between similar languages is no cakewalk, linguistically distinct languages require more data to give better results.

pdf bib
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Girish Nath Jha | Kalika Bali | Sobha L. | S. S. Agrawal | Atul Kr. Ojha
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

pdf bib
Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri
Atul Kr. Ojha | Daniel Zeman
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

This paper presents the first dependency treebank for Bhojpuri, a resource-poor language that belongs to the Indo-Aryan language family. The objective behind the Bhojpuri Treebank (BHTB) project is to create a substantial, syntactically annotated treebank which not only acts as a valuable resource in building language technological tools, also helps in cross-lingual learning and typological research. Currently, the treebank consists of 4,881 annotated tokens in accordance with the annotation scheme of Universal Dependencies (UD). A Bhojpuri tagger and parser were created using machine learning approach. The accuracy of the model is 57.49% UAS, 45.50% LAS, 79.69% UPOS accuracy and 77.64% XPOS accuracy. The paper describes the details of the project including a discussion on linguistic analysis and annotation process of the Bhojpuri UD treebank.

2019

pdf bib
Panlingua-KMI MT System for Similar Language Translation Task at WMT 2019
Atul Kr. Ojha | Ritesh Kumar | Akanksha Bansal | Priya Rani
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

The present paper enumerates the development of Panlingua-KMI Machine Translation (MT) systems for Hindi ↔ Nepali language pair, designed as part of the Similar Language Translation Task at the WMT 2019 Shared Task. The Panlingua-KMI team conducted a series of experiments to explore both the phrase-based statistical (PBSMT) and neural methods (NMT). Among the 11 MT systems prepared under this task, 6 PBSMT systems were prepared for Nepali-Hindi, 1 PBSMT for Hindi-Nepali and 2 NMT systems were developed for Nepali↔Hindi. The results show that PBSMT could be an effective method for developing MT systems for closely-related languages. Our Hindi-Nepali PBSMT system was ranked 2nd among the 13 systems submitted for the pair and our Nepali-Hindi PBSMTsystem was ranked 4th among the 12 systems submitted for the task.

pdf bib
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Valentin Malykh | Xiaobing Zhao
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

pdf bib
KMI-Coling at SemEval-2019 Task 6: Exploring N-grams for Offensive Language detection
Priya Rani | Atul Kr. Ojha
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present the system description of Offensive language detection tool which is developed by the KMI_Coling under the OffensEval Shared task. The OffensEval Shared Task was conducted in SemEval 2019 workshop. To develop the system, we have explored n-grams up to 8-gram and trained three different namely A, B and C systems for three different subtasks within the OffensEval task which achieves 79.76%, 87.91% and 44.37% accuracy respectively. The task was completed using the dataset provided to us by the OffensEval organisers was the part of OLID dataset. It consists of 13,240 tweets extracted from twitter and were annotated at three levels using crowdsourcing.

2018

pdf bib
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Ritesh Kumar | Atul Kr. Ojha | Marcos Zampieri | Shervin Malmasi
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

pdf bib
Benchmarking Aggression Identification in Social Media
Ritesh Kumar | Atul Kr. Ojha | Shervin Malmasi | Marcos Zampieri
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

In this paper, we present the report and findings of the Shared Task on Aggression Identification organised as part of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING 2018. The task was to develop a classifier that could discriminate between Overtly Aggressive, Covertly Aggressive, and Non-aggressive texts. For this task, the participants were provided with a dataset of 15,000 aggression-annotated Facebook Posts and Comments each in Hindi (in both Roman and Devanagari script) and English for training and validation. For testing, two different sets - one from Facebook and another from a different social media - were provided. A total of 130 teams registered to participate in the task, 30 teams submitted their test runs, and finally 20 teams also sent their system description paper which are included in the TRAC workshop proceedings. The best system obtained a weighted F-score of 0.64 for both Hindi and English on the Facebook test sets, while the best scores on the surprise set were 0.60 and 0.50 for English and Hindi respectively. The results presented in this report depict how challenging the task is. The positive response from the community and the great levels of participation in the first edition of this shared task also highlights the interest in this topic.

pdf bib
The RGNLP Machine Translation Systems for WAT 2018
Atul Kr. Ojha | Koel Dutta Chowdhury | Chao-Hong Liu | Karan Saxena
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2016

pdf bib
The IMAGACT4ALL Ontology of Animated Images: Implications for Theoretical and Machine Translation of Action Verbs from English-Indian Languages
Pitambar Behera | Sharmin Muzaffar | Atul Ku. Ojha | Girish Jha
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

Action verbs are one of the frequently occurring linguistic elements in any given natural language as the speakers use them during every linguistic intercourse. However, each language expresses action verbs in its own inherently unique manner by categorization. One verb can refer to several interpretations of actions and one action can be expressed by more than one verb. The inter-language and intra-language variations create ambiguity for the translation of languages from the source language to target language with respect to action verbs. IMAGACT is a corpus-based ontological platform of action verbs translated from prototypic animated images explained in English and Italian as meta-languages. In this paper, we are presenting the issues and challenges in translating action verbs of Indian languages as target and English as source language by observing the animated images. Among the ten Indian languages which have been annotated so far on the platform are Sanskrit, Hindi, Urdu, Odia (Oriya), Bengali, Manipuri, Tamil, Assamese, Magahi and Marathi. Out of them, Manipuri belongs to the Sino-Tibetan, Tamil comes off the Dravidian and the rest owe their genesis to the Indo-Aryan language family. One of the issues is that the one-word morphological English verbs are translated into most of the Indian languages as verbs having more than one-word form; for instance as in the case of conjunct, compound, serial verbs and so on. We are further presenting a cross-lingual comparison of action verbs among Indian languages. In addition, we are also dealing with the issues in disambiguating animated images by the L1 native speakers using competence-based judgements and the theoretical and machine translation implications they bear.