Abstract
In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25% is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language.- Anthology ID:
- 2021.ranlp-srw.4
- Volume:
- Proceedings of the Student Research Workshop Associated with RANLP 2021
- Month:
- September
- Year:
- 2021
- Address:
- Online
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 22–30
- Language:
- URL:
- https://aclanthology.org/2021.ranlp-srw.4
- DOI:
- Cite (ACL):
- Wazir Ali, Zenglin Xu, and Jay Kumar. 2021. SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pages 22–30, Online. INCOMA Ltd..
- Cite (Informal):
- SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging (Ali et al., RANLP 2021)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2021.ranlp-srw.4.pdf