Meesum Alam
2026
Common Voice for Pakistan: Developing an Open Speech Corpus for Low-Resource Pakistani Languages
Meesum Alam | Francis Tyers
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Meesum Alam | Francis Tyers
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Pakistan is home to more than 70 languages out of which 30 languages are endangered. Most of Pakistani languages remain absent from modern speech and text technologies, with resources focused on Urdu and a few major tongues. Through Mozilla’s Open Multilingual Speech Fund, this paper documents one year project for the development of an open, community driven speech corpus for 39 indigenous languages of Pakistan. The dataset includes locally authored texts, daily life sentences, poetry, and folk songs to make a culturally balanced. The project not only supports Automatic Speech Recognition but also promote linguistic preservation and digital inclusion.
2024
Universal Dependencies for Saraiki
Meesum Alam | Francis Tyers | Emily Hanink | Sandra Kübler
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Meesum Alam | Francis Tyers | Emily Hanink | Sandra Kübler
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
We present the first treebank of the Saraiki/Siraiki [ISO 639-3 skr] language, using the Universal Dependency annotation scheme (de Marneffe et al., 2021). The treebank currently comprises 587 annotated sentences and 7597 tokens. We explain the most relevant syntactic and morphological features of Saraiki, along with the decision we have made for a range of language specific constructions, namely compounds, verbal structures including light verb and serial verb constructions, and relative clauses.