Saurabh Kumar

2025

pdf bib abs
EssayDetect at GenAI Detection Task 2: Guardians of Academic Integrity: Multilingual Detection of AI-Generated Essays
Shifali Agrahari | Subhashi Jayant | Saurabh Kumar | Sanasam Ranbir Singh
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

Detecting AI-generated text in the field of academia is becoming very prominent. This paper presents a solution for Task 2: AI vs. Hu- man – Academic Essay Authenticity Challenge in the COLING 2025 DAIGenC Workshop 1. The rise of Large Language models (LLMs) like ChatGPT has posed significant challenges to academic integrity, particularly in detecting AI-generated essays. To address this, we pro- pose a fusion model that combines pre-trained language model embeddings with stylometric and linguistic features. Our approach, tested on both English and Arabic, utilizes adaptive training and attention mechanisms to enhance F1 scores, address class imbalance, and capture linguistic nuances across languages. This work advances multilingual solutions for detecting AI-generated text in academia.

pdf bib abs
Team IndiDataMiner at IndoNLP 2025: Hindi Back Transliteration - Roman to Devanagari using LLaMa
Saurabh Kumar | Dhruvkumar Babubhai Kakadiya | Sanasam Ranbir Singh
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

The increasing use of Romanized typing for Indo-Aryan languages on social media poses challenges due to its lack of standardization and loss of linguistic richness. To address this, we propose a sentence-level back-transliteration approach using the LLaMa 3.1 model for Hindi. Leveraging fine-tuning with the Dakshina dataset, our approach effectively resolves ambiguities in Romanized Hindi text, offering a robust solution for converting it into the native Devanagari script.

pdf bib abs
indiDataMiner at SemEval-2025 Task 11: From Text to Emotion: Transformer-Based Models for Emotions Detection in Indian Languages
Saurabh Kumar | Sujit Kumar | Sanasam Ranbir Singh | Sukumar Nandi
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Emotion detection is essential for applications like mental health monitoring and social media analysis, yet remains underexplored for Indian languages. This paper presents our system for SemEval-2025 Task 11 (Track A), focusing on multilabel emotion detection in Hindi and Marathi, two widely spoken Indian languages. We fine-tune IndicBERT v2 on the BRIGHTER dataset, achieving F1 scores of 87.37 (Hindi) and 88.32 (Marathi), outperforming baseline models. Our results highlight the effectiveness of fine-tuning a language-specific pretrained model for emotion detection, contributing to advancements in multilingual NLP research.

2024

pdf bib abs
IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation
Saurabh Kumar | Ranbir Sanasam | Sukumar Nandi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Sentiment analysis, a fundamental aspect of Natural Language Processing (NLP), involves the classification of emotions, opinions, and attitudes in text data. In the context of India, with its vast linguistic diversity and low-resource languages, the challenge is to support sentiment analysis in numerous Indian languages. This study explores the use of machine translation to bridge this gap. The investigation examines the feasibility of machine translation for creating sentiment analysis datasets in 22 Indian languages. Google Translate, with its extensive language support, is employed for this purpose in translating the Sentiment140 dataset. The study aims to provide insights into the practicality of using machine translation in the context of India’s linguistic diversity for sentiment analysis datasets. Our findings indicate that a dataset generated using Google Translate has the potential to serve as a foundational framework for tackling the low-resource challenges commonly encountered in sentiment analysis for Indian languages.

2023

pdf bib abs
IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment
Saurabh Kumar | Ranbir Sanasam | Sukumar Nandi
Findings of the Association for Computational Linguistics: EMNLP 2023

The increasing number of Indian language users on the internet necessitates the development of Indian language technologies. In response to this demand, our paper presents a generalized representation vector for diverse text characteristics, including native scripts, transliterated text, multilingual, code-mixed, and social media-related attributes. We gather text from both social media and well-formed sources and utilize the FastText model to create the “IndiSocialFT” embedding. Through intrinsic and extrinsic evaluation methods, we compare IndiSocialFT with three popular pretrained embeddings trained over Indian languages. Our findings show that the proposed embedding surpasses the baselines in most cases and languages, demonstrating its suitability for various NLP applications.

pdf bib
Including a contemporary NLP application within an introductory course: an example with student feedback from a University of Applied Sciences
Saurabh Kumar | Alessandra Zarcone
Proceedings of the 1st Workshop on Teaching for NLP