Annotating Indian Regional Biases using Large Language Models: Evaluation and Analysis

Debasmita Panda; Akash Anil; Neelesh Kumar Shukla

Annotating Indian Regional Biases using Large Language Models: Evaluation and Analysis

Debasmita Panda, Akash Anil, Neelesh Kumar Shukla

Abstract

Social biases based on regional identity (or regional bias) are often observed in Indian contexts on major online social networks and require critical attention. However, due to large linguistic and cultural diversity, high annotation costs, and inherent human biases, very little annotated data exists on regional biases in the Indian context. Recently, Large Language Models (LLMs) have garnered attention for the automatic annotation of text. However, such annotation efforts are largely limited to English texts, and LLMs often perform poorly when applied to low-resource languages. Therefore, this paper focuses on understanding the capabilities and challenges of popular open-source LLMs in annotating Indian regional biases. We utilize the recently proposed IndRegBias dataset, which consists of Indian regionally biased social media comments in both English and code-mixed formats. First, we assess the annotation capabilities of LLMs in a zero-shot setting and critically analyze their performance across different writing styles, including code-mixing, transliteration, and English. We find that the majority of LLMs exhibit low agreement with human annotations (measured using Cohen’s kappa). Consequently, we extend our study by fine-tuning the models using 50% of the data and evaluating them on the remaining 50%. We observe a significant improvement in annotation agreement (kappa) for all the LLMs. To further assess the capabilities of the fine-tuned models, we evaluate them on 500 newly collected social media comments discussing regional issues in India. The results show that most fine-tuned LLMs outperform their zero-shot counterparts when annotating these new comments.

Anthology ID:: 2026.starsem-conference.16
Volume:: Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Saif M. Mohammad, Nedjma Ousidhoum
Venues:: *SEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 255–263
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.starsem-conference.16/
DOI:
Bibkey:
Cite (ACL):: Debasmita Panda, Akash Anil, and Neelesh Kumar Shukla. 2026. Annotating Indian Regional Biases using Large Language Models: Evaluation and Analysis. In Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026), pages 255–263, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Annotating Indian Regional Biases using Large Language Models: Evaluation and Analysis (Panda et al., *SEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.starsem-conference.16.pdf

PDF Cite Search Fix data