Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot, Abhishek Srivastava
Abstract
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.- Anthology ID:
- 2020.acl-main.107
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Editors:
- Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1139–1150
- Language:
- URL:
- https://aclanthology.org/2020.acl-main.107
- DOI:
- 10.18653/v1/2020.acl-main.107
- Cite (ACL):
- Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot, and Abhishek Srivastava. 2020. Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1139–1150, Online. Association for Computational Linguistics.
- Cite (Informal):
- Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell (Seddah et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2020.acl-main.107.pdf