Mutee U Rahman

2025

Sindhi is an Indo-Aryan language spoken primarily in Pakistan and India by about 40 million people. Despite this extensive use, it is a low resource language for NLP tasks, with few datasets or pretrained embeddings available. In this work, we explore linguistic challenges for annotating Sindhi in the UD paradigm, such as language-specific analysis of adpositions and verb forms. We use this analysis to present a newly annotated dependency treebank for Universal Dependencies, along with pretrained embeddings and an annotation pipeline specifically for Sindhi annotation.

Co-authors

John Bauer 1
Christopher D. Manning 1
Shafi Pirzada 1
Sarwat Qureshi 1
Zubair Sanjrani 1

Sakiina Shah 1

Muhammad Shaheer 1

Mir Afza Ahmed Talpur 1

Venues

syntaxfest1
udw1
ws1

Fix author