Abstract
This paper presents the first dependency treebank for Bhojpuri, a resource-poor language that belongs to the Indo-Aryan language family. The objective behind the Bhojpuri Treebank (BHTB) project is to create a substantial, syntactically annotated treebank which not only acts as a valuable resource in building language technological tools, also helps in cross-lingual learning and typological research. Currently, the treebank consists of 4,881 annotated tokens in accordance with the annotation scheme of Universal Dependencies (UD). A Bhojpuri tagger and parser were created using machine learning approach. The accuracy of the model is 57.49% UAS, 45.50% LAS, 79.69% UPOS accuracy and 77.64% XPOS accuracy. The paper describes the details of the project including a discussion on linguistic analysis and annotation process of the Bhojpuri UD treebank.- Anthology ID:
- 2020.wildre-1.7
- Volume:
- Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Girish Nath Jha, Kalika Bali, Sobha L., S. S. Agrawal, Atul Kr. Ojha
- Venue:
- WILDRE
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 33–38
- Language:
- English
- URL:
- https://preview.aclanthology.org/add_missing_videos/2020.wildre-1.7/
- DOI:
- Cite (ACL):
- Atul Kr. Ojha and Daniel Zeman. 2020. Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri. In Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation, pages 33–38, Marseille, France. European Language Resources Association (ELRA).
- Cite (Informal):
- Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri (Ojha & Zeman, WILDRE 2020)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2020.wildre-1.7.pdf