Parallel resources for Tunisian Arabic Dialect Translation

Saméh Kchaou; Rahma Boujelbane; Lamia Hadrich Belguith

Parallel resources for Tunisian Arabic Dialect Translation

Saméh Kchaou, Rahma Boujelbane, Lamia Hadrich-Belguith

Abstract

The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.

Anthology ID:: 2020.wanlp-1.18
Volume:: Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 200–206
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.wanlp-1.18/
DOI:
Bibkey:
Cite (ACL):: Saméh Kchaou, Rahma Boujelbane, and Lamia Hadrich-Belguith. 2020. Parallel resources for Tunisian Arabic Dialect Translation. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 200–206, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):: Parallel resources for Tunisian Arabic Dialect Translation (Kchaou et al., WANLP 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.wanlp-1.18.pdf

PDF Cite Search Fix data