Somnath Banerjee


Hate Speech and Offensive Language Detection in Bengali
Mithun Das | Somnath Banerjee | Punyajoy Saha | Animesh Mukherjee
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.

hate-alert@DravidianLangTech-ACL2022: Ensembling Multi-Modalities for Tamil TrollMeme Classification
Mithun Das | Somnath Banerjee | Animesh Mukherjee
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Social media platforms often act as breeding grounds for various forms of trolling or malicious content targeting users or communities. One way of trolling users is by creating memes, which in most cases unites an image with a short piece of text embedded on top of it. The situation is more complex for multilingual(e.g., Tamil) memes due to the lack of benchmark datasets and models. We explore several models to detect Troll memes in Tamil based on the shared task, “Troll Meme Classification in DravidianLangTech2022” at ACL-2022. We observe while the text-based model MURIL performs better for Non-troll meme classification, the image-based model VGG16 performs better for Troll-meme classification. Further fusing these two modalities help us achieve stable outcomes in both classes. Our fusion model achieved a 0.561 weighted average F1 score and ranked second in this task.


IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages
Rexhina Blloshmi | Tommaso Pasini | Niccolò Campolungo | Somnath Banerjee | Roberto Navigli | Gabriella Pasi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

With the advent of contextualized embeddings, attention towards neural ranking approaches for Information Retrieval increased considerably. However, two aspects have remained largely neglected: i) queries usually consist of few keywords only, which increases ambiguity and makes their contextualization harder, and ii) performing neural ranking on non-English documents is still cumbersome due to shortage of labeled datasets. In this paper we present SIR (Sense-enhanced Information Retrieval) to mitigate both problems by leveraging word sense information. At the core of our approach lies a novel multilingual query expansion mechanism based on Word Sense Disambiguation that provides sense definitions as additional semantic information for the query. Importantly, we use senses as a bridge across languages, thus allowing our model to perform considerably better than its supervised and unsupervised alternatives across French, German, Italian and Spanish languages on several CLEF benchmarks, while being trained on English Robust04 data only. We release SIR at


LIMSI_UPV at SemEval-2020 Task 9: Recurrent Convolutional Neural Network for Code-mixed Sentiment Analysis
Somnath Banerjee | Sahar Ghannay | Sophie Rosset | Anne Vilnat | Paolo Rosso
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the participation of LIMSI_UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix HindiEnglish subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and the convolutional network to better capture the semantics of the text, for code-mixed sentiment analysis. The proposed system obtained 0.69 (best run) in terms of F1 score on the given test data and achieved the 9th place (Codalab username: somban) in the SentiMix Hindi-English subtask.


JU_ETCE_17_21 at SemEval-2019 Task 6: Efficient Machine Learning and Neural Network Approaches for Identifying and Categorizing Offensive Language in Tweets
Preeti Mukherjee | Mainak Pal | Somnath Banerjee | Sudip Kumar Naskar
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system submissions as part of our participation (team name: JU_ETCE_17_21) in the SemEval 2019 shared task 6: “OffensEval: Identifying and Catego- rizing Offensive Language in Social Media”. We participated in all the three sub-tasks: i) Sub-task A: offensive language identification, ii) Sub-task B: automatic categorization of of- fense types, and iii) Sub-task C: offense target identification. We employed machine learn- ing as well as deep learning approaches for the sub-tasks. We employed Convolutional Neural Network (CNN) and Recursive Neu- ral Network (RNN) Long Short-Term Memory (LSTM) with pre-trained word embeddings. We used both word2vec and Glove pre-trained word embeddings. We obtained the best F1- score using CNN based model for sub-task A, LSTM based model for sub-task B and Lo- gistic Regression based model for sub-task C. Our best submissions achieved 0.7844, 0.5459 and 0.48 F1-scores for sub-task A, sub-task B and sub-task C respectively.


NITMZ-JU at IJCNLP-2017 Task 4: Customer Feedback Analysis
Somnath Banerjee | Partha Pakray | Riyanka Manna | Dipankar Das | Alexander Gelbukh
Proceedings of the IJCNLP 2017, Shared Tasks

In this paper, we describe a deep learning framework for analyzing the customer feedback as part of our participation in the shared task on Customer Feedback Analysis at the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). A Convolutional Neural Network (CNN) based deep neural network model was employed for the customer feedback task. The proposed system was evaluated on two languages, namely, English and French.


An Empirical Study of Combing Multiple Models in Bengali Question Classification
Somnath Banerjee | Sivaji Bandyopadhyay
Proceedings of the Sixth International Joint Conference on Natural Language Processing


Bengali Question Classification: Towards Developing QA System
Somnath Banerjee | Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Question Classification and Answering from Procedural Text in English
Somnath Banerjee | Sivaji Bandyopadhyay
Proceedings of the Workshop on Question Answering for Complex Domains


Bengali Verb Subcategorization Frame Acquisition - A Baseline Model
Somnath Banerjee | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)