Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus
Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser, Andreas Nürnberger
Abstract
We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.- Anthology ID:
- W18-3809
- Volume:
- Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Peter Machonis, Anabela Barreiro, Kristina Kocijan, Max Silberztein
- Venue:
- LR4NLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 65–70
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/W18-3809/
- DOI:
- Cite (ACL):
- Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser, and Andreas Nürnberger. 2018. Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus. In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 65–70, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus (Gezmu et al., LR4NLP 2018)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/W18-3809.pdf