Universal Dependencies for Sindhi

John Bauer; Sakiina Shah; Muhammad Shaheer; Mir Afza Ahmed Talpur; Zubair Sanjrani; Sarwat Qureshi; Shafi Pirzada; Christopher D. Manning; Mutee U Rahman

Universal Dependencies for Sindhi

John Bauer, Sakiina Shah, Muhammad Shaheer, Mir Afza Ahmed Talpur, Zubair Sanjrani, Sarwat Qureshi, Shafi Pirzada, Christopher D. Manning, Mutee U Rahman

Abstract

Sindhi is an Indo-Aryan language spoken primarily in Pakistan and India by about 40 million people. Despite this extensive use, it is a low resource language for NLP tasks, with few datasets or pretrained embeddings available. In this work, we explore linguistic challenges for annotating Sindhi in the UD paradigm, such as language-specific analysis of adpositions and verb forms. We use this analysis to present a newly annotated dependency treebank for Universal Dependencies, along with pretrained embeddings and an annotation pipeline specifically for Sindhi annotation.

Anthology ID:: 2025.udw-1.11
Volume:: Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)
Month:: August
Year:: 2025
Address:: Ljubljana, Slovenia
Editors:: Gosse Bomma, Çağrı Çöltekin
Venues:: UDW | WS | SyntaxFest
SIG:: SIGPARSE
Publisher:: Association for Computational Linguistics
Note:
Pages:: 105–118
Language:
URL:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.udw-1.11/
DOI:
Bibkey:
Cite (ACL):: John Bauer, Sakiina Shah, Muhammad Shaheer, Mir Afza Ahmed Talpur, Zubair Sanjrani, Sarwat Qureshi, Shafi Pirzada, Christopher D. Manning, and Mutee U Rahman. 2025. Universal Dependencies for Sindhi. In Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025), pages 105–118, Ljubljana, Slovenia. Association for Computational Linguistics.
Cite (Informal):: Universal Dependencies for Sindhi (Bauer et al., UDW-SyntaxFest 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.udw-1.11.pdf

PDF Cite Search Fix data