Sukomal Pal


2026

This paper presents the IReLIIT(BHU) submission to SemEval-2026 Task 9 for the Chinese language track. We participated in all three subtasks: binary polarization detection,multi-label polarization type classification, and multi-label manifestation identification. Our approach is based on a unified transformer based framework with cross-validation, prediction aggregation, and threshold optimization to improve robustness across tasks. On the official evaluation, our systems achieved Macro-F1 scores of 0.9081, 0.7962, and 0.6484 for Subtasks 1, 2, and 3, respectively on test data.
Over the past decade, the rapid advancement of LLMs has significantly improved natural language generation. However, these models often inherit and amplify gender biases present in large-scale training data, leading to stereotypical associations, androcentric language, and misgendering. Such biases can negatively impact applications in education, healthcare, legal systems, and automated content generation. In this paper, we address this issue as defined in the shared task LT-EDI on Gender-Inclusive Language Generation. The task focuses on rewriting gender-biased sentences into inclusive, gender-neutral alternatives while preserving meaning. We propose a retrieval-augmented framework combining lexical replacement, semantic retrieval, and controlled instruction-tuned generation. An edit-distance constraint and self-evaluation step ensure minimal, coherent, and bias-free outputs. We also present zero-shot adaptation for low resource language. The implementation code available here https://github.com/SupriyaChanda/gilg-ltedi-acl2026.git.
This paper presents our submissions to the LT-EDI@ACL 2026 Shared Task on Gender Inclusive Language Generation. The task focuses on controlled text rewriting that reduces gender bias while keeping the original meaning and fluency intact. We participated in boththe subtasks and treated them independently, training separate instances of the instruction-tuned encoder–decoder model on the respective training datasets. Scores are calculated based on averages across different rubrics, including Gender Assumption (GA), Gender Neutrality (GN), and Quality Relevance (QR) for Task A, and Politeness and Respectful (PR), Contextual Counter-Narrative Coherence (CCNC), and Quality and Relevance (QR) for Task B.For Subtask A (Gender-Inclusive Language Generation) in the English dataset, an average score of 43.7917 could be achieved. For Subtask B (Counterfactual Generation), we achieved an average score of 82.6241. Overall, the experiments indicate that full finetuning of instruction-tuned transformers provides an effective way to produce sentence in gender-neutral form and also producing counter-factual sentences for biased one, wheneach subtask is optimized on its own data.

2025

We explore and evaluate the effect of different language-independent stemmers in the information retrieval (IR) tasks with Indian languages such as Hindi, Gujarati, and English. The issue was examined from two points of view. Does a language-independent stemmer improve retrieval effectiveness in Indian languages IR? Which language-independent stemmer is the most suitable for different Indian languages? It is observed that stemming enhances retrieval efficiency in different Indian languages compared to the no stemming approaches. Among the different stemmers experimented with, the co-occurrence-based stemmer (SNS) performs the best and improves a mean average precision (MAP) score by 2.98% in Hindi, and 20.78% in Gujarati languages, respectively, whereas the graph-based stemmer (GRAS) performs the best and improves a MAP score by 5.83% in English.

2023

Domain generalization is hitherto an underexplored area applied in abstractive summarization. Moreover, most existing works on domain generalization have sophisticated training algorithms. In this paper, we propose a lightweight, weight averaging based, Domain Aligned Prefix Averaging approach to domain generalization for abstractive summarization. Given a number of source domains, our method first trains a prefix for each one of them. These source prefixes generate summaries for a small number of target domain documents. The similarity of the generated summaries to their corresponding source documents is used for calculating weights required to average source prefixes. In DAPA, prefix tuning allows for lightweight finetuning, and weight averaging allows for the computationally efficient addition of new source domains. When evaluated on four diverse summarization domains, DAPA shows comparable or better performance against the baselines demonstrating the effectiveness of its prefix averaging scheme.

2020

This paper reports our submission to the shared Task 2: Identification of informative COVID-19 English tweets at W-NUT 2020. We attempted a few techniques, and we briefly explain here two models that showed promising results in tweet classification tasks: DistilBERT and FastText. DistilBERT achieves a F1 score of 0.7508 on the test set, which is the best of our submissions.
This paper describes the IRlab@IIT-BHU system for the OffensEval 2020. We take the SVM with TF-IDF features to identify and categorize hate speech and offensive language in social media for two languages. In subtask A, we used a linear SVM classifier to detect abusive content in tweets, achieving a macro F1 score of 0.779 and 0.718 for Arabic and Greek, respectively.
In social media, people express themselves every day on issues that affect their lives. During the parliamentary elections, people’s interaction with the candidates in social media posts reflects a lot of social trends in a charged atmosphere. People’s likes and dislikes on leaders, political parties and their stands often become subject of hate and offensive posts. We collected social media posts in Hindi and English from Facebook and Twitter during the run-up to the parliamentary election 2019 of India (PEI data-2019). We created a dataset for sentiment analysis into three categories: hate speech, offensive and not hate, or not offensive. We report here the initial results of sentiment classification for the dataset using different classifiers.