Abstract
Dependency annotation can be a laborious process for under-resourced languages. However, in some cases, other resources are available. We investigate whether we can leverage such resources in the case of Swahili: We use the Helsinki Corpus of Swahili for creating a Universal Depedencies treebank for Swahili. The Helsinki Corpus of Swahili provides word-level annotations for part of speech tags, morphological features, and functional syntactic tags. We train neural taggers for these types of annotations, then use those models to annotate our target corpus, the Swahili portion of the OPUS Global Voices Corpus. Based on those annotations, we then manually create constraint grammar rules to annotate the target corpus for Universal Dependencies. In this paper, we describe the process, discuss the annotation decisions we had to make, and we evaluate the approach.- Anthology ID:
- 2023.rail-1.10
- Volume:
- Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Rooweither Mabuya, Don Mthobela, Mmasibidi Setaka, Menno Van Zaanen
- Venue:
- RAIL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 86–96
- Language:
- URL:
- https://aclanthology.org/2023.rail-1.10
- DOI:
- 10.18653/v1/2023.rail-1.10
- Cite (ACL):
- Kenneth Steimel and Sandra Kübler. 2023. Towards a Swahili Universal Dependency Treebank: Leveraging the Annotations of the Helsinki Corpus of Swahili. In Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023), pages 86–96, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Towards a Swahili Universal Dependency Treebank: Leveraging the Annotations of the Helsinki Corpus of Swahili (Steimel & Kübler, RAIL 2023)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2023.rail-1.10.pdf