Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Toqeer Ehsan, Thamar Solorio


Abstract
Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, in this paper, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.
Anthology ID:
2025.wnut-1.13
Volume:
Proceedings of the Tenth Workshop on Noisy and User-generated Text
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, USA
Editors:
JinYeong Bak, Rob van der Goot, Hyeju Jang, Weerayut Buaphet, Alan Ramponi, Wei Xu, Alan Ritter
Venues:
WNUT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
117–132
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.wnut-1.13/
DOI:
Bibkey:
Cite (ACL):
Toqeer Ehsan and Thamar Solorio. 2025. Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation. In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 117–132, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation (Ehsan & Solorio, WNUT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.wnut-1.13.pdf