2024
pdf
abs
Multilingual Text Style Transfer: Datasets & Models for Indian Languages
Sourabrata Mukherjee
|
Atul Kr. Ojha
|
Akanksha Bansal
|
Deepak Alok
|
John P. McCrae
|
Ondrej Dusek
Proceedings of the 17th International Natural Language Generation Conference
Text style transfer (TST) involves altering the linguistic style of a text while preserving its style-independent content. This paper focuses on sentiment transfer, a popular TST subtask, across a spectrum of Indian languages: Hindi, Magahi, Malayalam, Marathi, Punjabi, Odia, Telugu, and Urdu, expanding upon previous work on English-Bangla sentiment transfer. We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages. We then evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches, including the Llama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the significance of parallel data in TST and demonstrate the effectiveness of the Masked Style Filling (MSF) approach in non-parallel techniques. Moreover, cross-lingual and joint multilingual learning methods show promise, offering insights into selecting optimal models tailored to the specific language and task requirements. To the best of our knowledge, this work represents the first comprehensive exploration of the TST task as sentiment transfer across a diverse set of languages.
2023
pdf
abs
Low-Resource Text Style Transfer for Bangla: Data & Models
Sourabrata Mukherjee
|
Akanksha Bansal
|
Pritha Majumdar
|
Atul Kr. Ojha
|
Ondřej Dušek
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Text style transfer (TST) involves modifying the linguistic style of a given text while retaining its core content. This paper addresses the challenging task of text style transfer in the Bangla language, which is low-resourced in this area. We present a novel Bangla dataset that facilitates text sentiment transfer, a subtask of TST, enabling the transformation of positive sentiment sentences to negative and vice versa. To establish a high-quality base for further research, we refined and corrected an existing English dataset of 1,000 sentences for sentiment transfer based on Yelp reviews, and we introduce a new human-translated Bangla dataset that parallels its English counterpart. Furthermore, we offer multiple benchmark models that serve as a validation of the dataset and baseline for further research.
pdf
abs
Text Detoxification as Style Transfer in English and Hindi
Sourabrata Mukherjee
|
Akanksha Bansal
|
Atul Kr. Ojha
|
John P. McCrae
|
Ondrej Dusek
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
This paper focuses on text detoxification, i.e., automatically converting toxic text into nontoxic text. This task contributes to safer and more respectful online communication and can be considered a Text Style Transfer (TST) task, where the text’s style changes while its content is preserved. We present three approaches: (i) knowledge transfer from a similar task (ii) multi-task learning approach, combining sequence-to-sequence modeling with various toxicity classification tasks, and (iii) delete and reconstruct approach. To support our research, we utilize a dataset provided by Dementieva et al. (2021), which contains multiple versions of detoxified texts corresponding to toxic texts. In our experiments, we selected the best variants through expert human annotators, creating a dataset where each toxic sentence is paired with a single, appropriate detoxified version. Additionally, we introduced a small Hindi parallel dataset, aligning with a part of the English dataset, suitable for evaluation purposes. Our results demonstrate that our approach effectively balances text detoxification while preserving the actual content and maintaining fluency.
2022
pdf
abs
The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse
Ritesh Kumar
|
Shyam Ratan
|
Siddharth Singh
|
Enakshi Nandi
|
Laishram Niranjana Devi
|
Akash Bhagat
|
Yogesh Dawer
|
Bornini Lahiri
|
Akanksha Bansal
|
Atul Kr. Ojha
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context” in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here consists of a total 59,152 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that has been used for marking comments with aggression and bias of various kinds including sexism (called gender bias in the tagset), religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking the different discursive role being performed through the comments, such as attack, defend, etc. Finally, we present a basic statistical analysis of the dataset. The dataset is being incrementally made publicly available on the project website.
pdf
abs
Bengali and Magahi PUD Treebank and Parser
Pritha Majumdar
|
Deepak Alok
|
Akanksha Bansal
|
Atul Kr. Ojha
|
John P. McCrae
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference
This paper presents the development of the Parallel Universal Dependency (PUD) Treebank for two Indo-Aryan languages: Bengali and Magahi. A treebank of 1,000 sentences has been created using a parallel corpus of English and the UD framework. A preliminary set of sentences was annotated manually - 600 for Bengali and 200 for Magahi. The rest of the sentences were built using the Bengali and Magahi parser. The sentences have been translated and annotated manually by the authors, some of whom are also native speakers of the languages. The objective behind this work is to build a syntactically-annotated linguistic repository for the aforementioned languages, that can prove to be a useful resource for building further NLP tools. Additionally, Bengali and Magahi parsers were also created which is built on machine learning approach. The accuracy of the Bengali parser is 78.13% in the case of UPOS; 76.99% in the case of XPOS, 56.12% in the case of UAS; and 47.19% in the case of LAS. The accuracy of Magahi parser is 71.53% in the case of UPOS; 66.44% in the case of XPOS, 58.05% in the case of UAS; and 33.07% in the case of LAS. This paper also includes an illustration of the annotation schema followed, the findings of the Parallel Universal Dependency (PUD) treebank, and it’s resulting linguistic analysis
2021
pdf
bib
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification
Ritesh Kumar
|
Siddharth Singh
|
Enakshi Nandi
|
Shyam Ratan
|
Laishram Niranjana Devi
|
Bornini Lahiri
|
Akanksha Bansal
|
Akash Bhagat
|
Yogesh Dawer
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification
pdf
bib
abs
ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Task at ICON-2021
Ritesh Kumar
|
Shyam Ratan
|
Siddharth Singh
|
Enakshi Nandi
|
Laishram Niranjana Devi
|
Akash Bhagat
|
Yogesh Dawer
|
Bornini Lahiri
|
Akanksha Bansal
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification
This paper presents the findings of the ICON-2021 shared task on Multilingual Gender Biased and Communal Language Identification, which aims to identify aggression, gender bias, and communal bias in data presented in four languages: Meitei, Bangla, Hindi and English. The participants were presented the option of approaching the task as three separate classification tasks or a multi-label classification task or a structured classification task. If approached as three separate classification tasks, the task includes three sub-tasks: aggression identification (sub-task A), gender bias identification (sub-task B), and communal bias identification (sub-task C). For this task, the participating teams were provided with a total dataset of approximately 12,000, with 3,000 comments across each of the four languages, sourced from popular social media sites such as YouTube, Twitter, Facebook and Telegram and the the three labels presented as a single tuple. For the test systems, approximately 1,000 comments were provided in each language for every sub-task. We attracted a total of 54 registrations in the task, out of which 11 teams submitted their test runs. The best system obtained an overall instance-F1 of 0.371 in the multilingual test set (it was simply a combined test set of the instances in each individual language). In the individual sub-tasks, the best micro f1 scores are 0.539, 0.767 and 0.834 respectively for each of the sub-task A, B and C. The best overall, averaged micro f1 is 0.713. The results show that while systems have managed to perform reasonably well in individual sub-tasks, especially gender bias and communal bias tasks, it is substantially more difficult to do a 3-class classification of aggression level and even more difficult to build a system that correctly classifies everything right. It is only in slightly over 1/3 of the instances that most of the systems predicted the correct class across the board, despite the fact that there was a significant overlap across the three sub-tasks.
2020
pdf
bib
abs
KMI-Panlingua-IITKGP @SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features
Ritesh Kumar
|
Deepak Alok
|
Akanksha Bansal
|
Bornini Lahiri
|
Atul Kr. Ojha
Proceedings of the Second Workshop on Computational Research in Linguistic Typology
This paper enumerates SigTyP 2020 Shared Task on the prediction of typological features as performed by the KMI-Panlingua-IITKGP team. The task entailed the prediction of missing values in a particular language, provided, the name of the language family, its genus, location (in terms of latitude and longitude coordinates and name of the country where it is spoken) and a set of feature-value pair are available. As part of fulfillment of the aforementioned task, the team submitted 3 kinds of system - 2 rule-based and one hybrid system. Of these 3, one rule-based system generated the best performance on the test set. All the systems were ‘constrained’ in the sense that no additional dataset or information, other than those provided by the organisers, was used for developing the systems.
pdf
abs
Developing a Multilingual Annotated Corpus of Misogyny and Aggression
Shiladitya Bhattacharya
|
Siddharth Singh
|
Ritesh Kumar
|
Akanksha Bansal
|
Akash Bhagat
|
Yogesh Dawer
|
Bornini Lahiri
|
Atul Kr. Ojha
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.
pdf
abs
NUIG-Panlingua-KMI Hindi-Marathi MT Systems for Similar Language Translation Task @ WMT 2020
Atul Kr. Ojha
|
Priya Rani
|
Akanksha Bansal
|
Bharathi Raja Chakravarthi
|
Ritesh Kumar
|
John P. McCrae
Proceedings of the Fifth Conference on Machine Translation
NUIG-Panlingua-KMI submission to WMT 2020 seeks to push the state-of-the-art in Similar Language Translation Task for Hindi↔Marathi language pair. As part of these efforts, we conducteda series of experiments to address the challenges for translation between similar languages. Among the 4 MT systems prepared under this task, 1 PBSMT systems were prepared for Hindi↔Marathi each and 1 NMT systems were developed for Hindi↔Marathi using Byte PairEn-coding (BPE) into subwords. The results show that different architectures NMT could be an effective method for developing MT systems for closely related languages. Our Hindi-Marathi NMT system was ranked 8th among the 14 teams that participated and our Marathi-Hindi NMT system was ranked 8th among the 11 teams participated for the task.
2019
pdf
abs
Panlingua-KMI MT System for Similar Language Translation Task at WMT 2019
Atul Kr. Ojha
|
Ritesh Kumar
|
Akanksha Bansal
|
Priya Rani
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
The present paper enumerates the development of Panlingua-KMI Machine Translation (MT) systems for Hindi ↔ Nepali language pair, designed as part of the Similar Language Translation Task at the WMT 2019 Shared Task. The Panlingua-KMI team conducted a series of experiments to explore both the phrase-based statistical (PBSMT) and neural methods (NMT). Among the 11 MT systems prepared under this task, 6 PBSMT systems were prepared for Nepali-Hindi, 1 PBSMT for Hindi-Nepali and 2 NMT systems were developed for Nepali↔Hindi. The results show that PBSMT could be an effective method for developing MT systems for closely-related languages. Our Hindi-Nepali PBSMT system was ranked 2nd among the 13 systems submitted for the pair and our Nepali-Hindi PBSMTsystem was ranked 4th among the 12 systems submitted for the task.