Multi-Parallel Corpus of North Levantine Arabic
Mateusz Krubiński, Hashem Sellat, Shadi Saleh, Adam Pospíšil, Petr Zemánek, Pavel Pecina
Abstract
Low-resource Machine Translation (MT) is characterized by the scarce availability of training data and/or standardized evaluation benchmarks. In the context of Dialectal Arabic, recent works introduced several evaluation benchmarks covering both Modern Standard Arabic (MSA) and dialects, mapping, however, mostly to a single Indo-European language - English. In this work, we introduce a multi-lingual corpus consisting of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and MSA selected from the OpenSubtitles corpus, which were manually translated into the North Levantine Arabic. By conducting a series of training and fine-tuning experiments, we explore how this novel resource can contribute to the research on Arabic MT.- Anthology ID:
- 2023.arabicnlp-1.34
- Volume:
- Proceedings of ArabicNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore (Hybrid)
- Editors:
- Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali, Nadi Tomeh, Ibrahim Abu Farha, Nizar Habash, Salam Khalifa, Amr Keleg, Hatem Haddad, Imed Zitouni, Khalil Mrini, Rawan Almatham
- Venues:
- ArabicNLP | WS
- SIG:
- SIGARAB
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 411–417
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.arabicnlp-1.34/
- DOI:
- 10.18653/v1/2023.arabicnlp-1.34
- Cite (ACL):
- Mateusz Krubiński, Hashem Sellat, Shadi Saleh, Adam Pospíšil, Petr Zemánek, and Pavel Pecina. 2023. Multi-Parallel Corpus of North Levantine Arabic. In Proceedings of ArabicNLP 2023, pages 411–417, Singapore (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- Multi-Parallel Corpus of North Levantine Arabic (Krubiński et al., ArabicNLP 2023)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2023.arabicnlp-1.34.pdf