Abstract
Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high- resource languages. Building language mod- els and, more generally, NLP systems for non- standardized and low-resource languages remains a challenging task. In this work, we fo- cus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data display- ing a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set- tings.- Anthology ID:
- 2021.wnut-1.47
- Volume:
- Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Editors:
- Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 423–436
- Language:
- URL:
- https://aclanthology.org/2021.wnut-1.47
- DOI:
- 10.18653/v1/2021.wnut-1.47
- Cite (ACL):
- Arij Riabi, Benoît Sagot, and Djamé Seddah. 2021. Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 423–436, Online. Association for Computational Linguistics.
- Cite (Informal):
- Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios? (Riabi et al., WNUT 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2021.wnut-1.47.pdf