Prachuryya Kaushik
2025
CLASSER: Cross-lingual Annotation Projection enhancement through Script Similarity for Fine-grained Named Entity Recognition
Prachuryya Kaushik
|
Ashish Anand
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We introduce CLASSER, a cross-lingual annotation projection framework enhanced through script similarity, to create fine-grained named entity recognition (FgNER) datasets for low-resource languages. Manual annotation for named entity recognition (NER) is expensive, and distant supervision often produces noisy data that are often limited to high-resource languages. CLASSER employs a two-stage process: first projection of annotations from high-resource NER datasets to target language by using source-to-target parallel corpora and a projection tool built on a multilingual encoder, then refining them by leveraging datasets in script-similar languages. We apply this to five low-resource Indian languages: *Assamese*, *Marathi*, *Nepali*, *Sanskrit*, and *Bodo*, a vulnerable language. The resulting dataset comprises 1.8M sentences, 2.6M entity mentions and 24.7M tokens. Through rigorous analyses, the effectiveness of our method and the high quality of the resulting dataset are ascertained with F1 score improvements of 26% in Marathi and 46% in Sanskrit over the current state-of-the-art. We further extend our analyses to zero-shot and cross-lingual settings, systematically investigating the impact of script similarity and multilingualism on cross-lingual FgNER performance. The dataset is publicly available at [huggingface.co/datasets/prachuryyaIITG/CLASSER](https://huggingface.co/datasets/prachuryyaIITG/CLASSER).