Rudra Dhar


2024

pdf
Multilingual Bias Detection and Mitigation for Indian Languages
Ankita Maity | Anubhav Sharma | Rudra Dhar | Tushar Abhishek | Manish Gupta | Vasudeva Varma
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWIKIBIAS and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.

2022

pdf
JU_NLP at HinglishEval: Quality Evaluation of the Low-Resource Code-Mixed Hinglish Text
Prantik Guha | Rudra Dhar | Dipankar Das
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

In this paper we describe a system submitted to the INLG 2022 Generation Challenge (GenChal) on Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text. We implement a Bi-LSTM-based neural network model to predict the Average rating score and Disagreement score of the synthetic Hinglish dataset. In our models, we used word embeddings for English and Hindi data, and one hot encodings for Hinglish data. We achieved a F1 score of 0.11, and mean squared error of 6.0 in the average rating score prediction task. In the task of Disagreement score prediction, we achieve a F1 score of 0.18, and mean squared error of 5.0.

2021

pdf
Leveraging Expectation Maximization for Identifying Claims in Low Resource Indian Languages
Rudra Dhar | Dipankar Das
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Identification of the checkable claims is one of the important prior tasks while dealing with infinite amount of data streaming from social web and the task becomes a compulsory one when we analyze them on behalf of a multilingual country like India that contains more than 1 billion people. In the present work, we describe our system which is made for detecting check-worthy claim sentences in resource scarce Indian languages (e.g., Bengali and Hindi). Firstly, we collected sentences from various sources in Bengali and Hindi and vectorized them with several NLP features. We labeled a small portion of them for check-worthy claims manually. However, in order to label rest amount of data in a semi-supervised fashion, we employed the Expectation Maximization (EM) algorithm tuned with the Multivariate Gaussian Mixture Model (GMM) to assign weakly labels. The optimal number of Gaussians in this algorithm is traced by using Logistic Regression. Furthermore, we used different ratios of manually labeled data and weakly labeled data to train our various machine learning models. We tabulated and plotted the performances of the models along with the stepwise decrement in proportion of manually labeled data. The experimental results were at par with our theoretical understanding, and we conclude that the weakly labeling of check-worthy claim sentences in low resource languages with EM algorithm has true potential.