CLASSER: Cross-lingual Annotation Projection enhancement through Script Similarity for Fine-grained Named Entity Recognition

Prachuryya Kaushik, Ashish Anand


Abstract
We introduce CLASSER, a cross-lingual annotation projection framework enhanced through script similarity, to create fine-grained named entity recognition (FgNER) datasets for low-resource languages. Manual annotation for named entity recognition (NER) is expensive, and distant supervision often produces noisy data that are often limited to high-resource languages. CLASSER employs a two-stage process: first projection of annotations from high-resource NER datasets to target language by using source-to-target parallel corpora and a projection tool built on a multilingual encoder, then refining them by leveraging datasets in script-similar languages. We apply this to five low-resource Indian languages: *Assamese*, *Marathi*, *Nepali*, *Sanskrit*, and *Bodo*, a vulnerable language. The resulting dataset comprises 1.8M sentences, 2.6M entity mentions and 24.7M tokens. Through rigorous analyses, the effectiveness of our method and the high quality of the resulting dataset are ascertained with F1 score improvements of 26% in Marathi and 46% in Sanskrit over the current state-of-the-art. We further extend our analyses to zero-shot and cross-lingual settings, systematically investigating the impact of script similarity and multilingualism on cross-lingual FgNER performance. The dataset is publicly available at [huggingface.co/datasets/prachuryyaIITG/CLASSER](https://huggingface.co/datasets/prachuryyaIITG/CLASSER).
Anthology ID:
2025.ijcnlp-long.94
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:
IJCNLP | AACL
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
1745–1760
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.94/
DOI:
Bibkey:
Cite (ACL):
Prachuryya Kaushik and Ashish Anand. 2025. CLASSER: Cross-lingual Annotation Projection enhancement through Script Similarity for Fine-grained Named Entity Recognition. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 1745–1760, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
CLASSER: Cross-lingual Annotation Projection enhancement through Script Similarity for Fine-grained Named Entity Recognition (Kaushik & Anand, IJCNLP-AACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.94.pdf