SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging

Wazir Ali; Zenglin Xu; Jay Kumar

SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging

Abstract

In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25% is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language.

Anthology ID:: 2021.ranlp-srw.4
Volume:: Proceedings of the Student Research Workshop Associated with RANLP 2021
Month:: September
Year:: 2021
Address:: Online
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 22–30
Language:
URL:: https://aclanthology.org/2021.ranlp-srw.4
DOI:
Bibkey:
Cite (ACL):: Wazir Ali, Zenglin Xu, and Jay Kumar. 2021. SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pages 22–30, Online. INCOMA Ltd..
Cite (Informal):: SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging (Ali et al., RANLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/paclic-22-ingestion/2021.ranlp-srw.4.pdf

PDF Search