2025
pdf
bib
abs
Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs
Himanshu Beniwal
|
Sailesh Panda
|
Birudugadda Srivibhav
|
Mayank Singh
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
We explore Cross-lingual Backdoor ATtacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings reveal a critical vulnerability that affects the model’s architecture, leading to a concealed backdoor effect during the information flow. Our code and data are publicly available at https://github.com/himanshubeniwal/X-BAT.
pdf
bib
abs
UnityAI Guard: Pioneering Toxicity Detection Across Low-Resource Indian Languages
Himanshu Beniwal
|
Reddybathuni Venkat
|
Rohit Kumar
|
Birudugadda Srivibhav
|
Daksh Jain
|
Pavan Deekshith Doddi
|
Eshwar Dhande
|
Adithya Ananth
|
Kuldeep
|
Mayank Singh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 567k training instances and 30k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.
pdf
bib
abs
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
Rajvee Sheth
|
Himanshu Beniwal
|
Mayank Singh
Findings of the Association for Computational Linguistics: EMNLP 2025
We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Token-level Language Identification, Matrix Language Identification, Named Entity Recognition, Part-Of-Speech Tagging and Machine Translation. Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations with strong inter-annotator agreement (Fleiss’ Kappa ≥ 0.81). The rigorously preprocessed and filtered dataset covers both Devanagari and Roman scripts and spans diverse domains, ensuring real-world linguistic coverage. Evaluation reveals that closed-weight LLMs significantly outperform traditional tools and open-weight models in zero-shot settings. Notably, one-shot prompting consistently boosts performance across tasks, especially in structure-sensitive predictions like POS and NER. Fine-tuning open-weight LLMs on COMI-LINGUA demonstrates substantial improvements, achieving up to 95.25 F1 in NER, 98.77 F1 in MLI, and competitive MT performance, setting new benchmarks for Hinglish code-mixed text. COMI-LINGUA is publicly available at this URL: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.
2024
pdf
bib
abs
Commentator: A Code-mixed Multilingual Text Annotation Framework
Rajvee Sheth
|
Shubh Nisar
|
Heenaben Prajapati
|
Himanshu Beniwal
|
Mayank Singh
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
As the NLP community increasingly addresses challenges associated with multilingualism, robust annotation tools are essential to handle multilingual datasets efficiently. In this paper, we introduce a code-mixed multilingual text annotation framework, COMMENTATOR, specifically designed for annotating code- mixed text. The tool demonstrates its effectiveness in token-level and sentence-level language annotation tasks for Hinglish text. We perform robust qualitative human-based evaluations to showcase COMMENTATOR led to 5x faster annotations than the best baseline.
pdf
bib
abs
Cross-lingual Editing in Multilingual Language Models
Himanshu Beniwal
|
Kowsik D
|
Mayank Singh
Findings of the Association for Computational Linguistics: EACL 2024
The training of large language models (LLMs) necessitates substantial data and computational resources, and updating outdated LLMs entails significant efforts and resources. While numerous model editing techniques (METs) have emerged to efficiently update model outputs without retraining, their effectiveness in multilingual LLMs, where knowledge is stored in diverse languages, remains an underexplored research area. This research paper introduces the cross-lingual model editing (XME) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages. To investigate the XME paradigm, we conducted experiments using BLOOM, mBERT, and XLM-RoBERTa using the two writing scripts: Latin (English, French, and Spanish) and Indic (Hindi, Gujarati, and Bengali). The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families. These findings highlight the need for further research and development of XME techniques to address these challenges. For more comprehensive information, the dataset used in this research and the associated code are publicly available at the following [URL](https://github.com/lingo-iitgn/XME).
pdf
bib
abs
LEGOBench: Scientific Leaderboard Generation Benchmark
Shruti Singh
|
Shoaib Alam
|
Husain Malwat
|
Mayank Singh
Findings of the Association for Computational Linguistics: EMNLP 2024
The ever-increasing volume of paper submissions makes it difficult to stay informed about the latest state-of-the-art research. To address this challenge, we introduce LEGOBench, a benchmark for evaluating systems that generate scientific leaderboards. LEGOBench is curated from 22 years of preprint submission data on arXiv and more than 11k machine learning leaderboards on the PapersWithCode portal. We present a language model-based and four graph-based leaderboard generation task configuration. We evaluate popular encoder-only scientific language models as well as decoder-only large language models across these task configurations. State-of-the-art models showcase significant performance gaps in automatic leaderboard generation on LEGOBench. The code is available on GitHub and the dataset is hosted on OSF.
pdf
bib
abs
Remember This Event That Year? Assessing Temporal Information and Understanding in Large Language Models
Himanshu Beniwal
|
Dishant Patel
|
Kowsik Nandagopan D
|
Hritik Ladia
|
Ankit Yadav
|
Mayank Singh
Findings of the Association for Computational Linguistics: EMNLP 2024
Large Language Models (LLMs) are increasingly ubiquitous, yet their ability to retain and reason about temporal information remains limited, hindering their application in real-world scenarios where understanding the sequential nature of events is crucial. Our study experiments with 12 state-of-the-art models (ranging from 2B to 70B+ parameters) on a novel numerical-temporal dataset, TempUN, spanning from 10,000 BCE to 2100 CE, to uncover significant temporal retention and comprehension limitations. We propose six metrics to assess three learning paradigms to enhance temporal knowledge acquisition. Our findings reveal that open-source models exhibit knowledge gaps more frequently, suggesting a trade-off between limited knowledge and incorrect responses. Additionally, various fine-tuning approaches significantly improved performance, reducing incorrect outputs and impacting the identification of ‘information not available’ in the generations. The associated dataset and code are available at the [URL](https://anonymous.4open.science/r/TempUN-ARR/).
pdf
bib
abs
PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs
Ankit Yadav
|
Himanshu Beniwal
|
Mayank Singh
Findings of the Association for Computational Linguistics: EMNLP 2024
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of *HumanEval* and *MBPP*, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/PythonSaga).
pdf
bib
abs
CoSAEmb: Contrastive Section-aware Aspect Embeddings for Scientific Articles
Shruti Singh
|
Mayank Singh
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Research papers are long documents that contain information about various aspects such as background, prior work, methodology, and results. Existing works on scientific document representation learning only leverage the title and abstract of the paper. We present CoSAEmb, a model that learns representations from the full text of 97402 scientific papers from the S2ORC dataset. We present a novel supervised contrastive training framework for long documents using triplet loss and margin gradation. Our framework can be used to learn representations of long documents with any existing encoder-only transformer model without retraining it from scratch. CoSAEmb shows improved performance on information retrieval from the paper’s full text in comparison to models trained only on paper titles and abstracts. We also evaluate CoSAEmb on SciRepEval and CSFCube benchmarks, showing comparable performance with existing state-of-the-art models.
2022
pdf
bib
abs
The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through
Shruti Singh
|
Mayank Singh
Findings of the Association for Computational Linguistics: ACL 2022
Language models are increasingly becoming popular in AI-powered scientific IR systems. This paper evaluates popular scientific language models in handling (i) short-query texts and (ii) textual neighbors. Our experiments showcase the inability to retrieve relevant documents for a short-query text even under the most relaxed conditions. Additionally, we leverage textual neighbors, generated by small perturbations to the original text, to demonstrate that not all perturbations lead to close neighbors in the embedding space. Further, an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors. Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text.