AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Wissam Antoun, Fady Baly, Hazem Hajj


Abstract
Recently, pre-trained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic are still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. The mega model was evaluated and showed success on different tasks including synthetic news generation, and zero-shot question answering. For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles. A study conducted with human evaluators showed the significant success of AraGPT2-mega in generating news articles that are difficult to distinguish from articles written by humans. We thus develop and release an automatic discriminator model with a 98% percent accuracy in detecting model-generated text. The models are also publicly available, hoping to encourage new research directions and applications for Arabic NLP.
Anthology ID:
2021.wanlp-1.21
Volume:
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Month:
April
Year:
2021
Address:
Kyiv, Ukraine (Virtual)
Editors:
Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
196–207
Language:
URL:
https://aclanthology.org/2021.wanlp-1.21
DOI:
Bibkey:
Cite (ACL):
Wissam Antoun, Fady Baly, and Hazem Hajj. 2021. AraGPT2: Pre-Trained Transformer for Arabic Language Generation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 196–207, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Cite (Informal):
AraGPT2: Pre-Trained Transformer for Arabic Language Generation (Antoun et al., WANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2021.wanlp-1.21.pdf
Code
 aub-mind/araBERT