Abstract
The difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.- Anthology ID:
- 2020.wanlp-1.18
- Volume:
- Proceedings of the Fifth Arabic Natural Language Processing Workshop
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Venue:
- WANLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 200–206
- Language:
- URL:
- https://aclanthology.org/2020.wanlp-1.18
- DOI:
- Cite (ACL):
- Saméh Kchaou, Rahma Boujelbane, and Lamia Hadrich-Belguith. 2020. Parallel resources for Tunisian Arabic Dialect Translation. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 200–206, Barcelona, Spain (Online). Association for Computational Linguistics.
- Cite (Informal):
- Parallel resources for Tunisian Arabic Dialect Translation (Kchaou et al., WANLP 2020)
- PDF:
- https://preview.aclanthology.org/author-url/2020.wanlp-1.18.pdf