FiNERVINER: Fine-grained Named Entity Recognition for Vulnerable Languages of India’s North Eastern Region

Prachuryya Kaushik, Ashish Anand


Abstract
Named entity recognition (NER), particularly fine-grained NER (FgNER), extracts domain-specific entity information for Natural Language Processing (NLP) applications such as knowledge base construction and relation extraction. While manual annotation for creating relevant data is expensive, distant supervision often produces noisy data. Moreover, resources for coarse-grained and fine-grained NER in Indian languages, particularly in the vulnerable languages of India’s North Eastern Region, remain scarce. This work aims at creating such a resource for three vulnerable languages: <i>Bodo/Boro (brx)</i>, <i>Manipuri/Meitei (mni)</i>, and <i>Mizo/Lushai (lus)</i>, which are regarded as official languages in three Indian states and spoken by more than six million people across five countries in South and Southeast Asia. We use annotations projection from high-resource FgNER datasets using source-to-target parallel corpora and a projection tool built on a multilingual encoder. The dataset comprises over 198k sentences, 282k entities, and 2.8M tokens in each low-resource language. Our thorough analyses validate the dataset’s high quality. We further explore zero-shot and cross-lingual settings, examining the impact of script similarity and multilingualism in cross-lingual FgNER performance. The dataset, expert detector models, the agentic tool, and the interactive web application are available as open-source resources at: <url>https://hf.co/collections/prachuryyaIITG/finerviner</url>.
Anthology ID:
2026.lrec-main.607
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
7655–7667
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.607/
DOI:
Bibkey:
Cite (ACL):
Prachuryya Kaushik and Ashish Anand. 2026. FiNERVINER: Fine-grained Named Entity Recognition for Vulnerable Languages of India’s North Eastern Region. International Conference on Language Resources and Evaluation, main:7655–7667.
Cite (Informal):
FiNERVINER: Fine-grained Named Entity Recognition for Vulnerable Languages of India’s North Eastern Region (Kaushik & Anand, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.607.pdf