TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed


Abstract
We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).
Anthology ID:
2022.osact-1.1
Volume:
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
OSACT
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1–11
Language:
URL:
https://aclanthology.org/2022.osact-1.1
DOI:
Bibkey:
Cite (ACL):
El Moatez Billah Nagoudi, AbdelRahim Elmadany, and Muhammad Abdul-Mageed. 2022. TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, pages 1–11, Marseille, France. European Language Resources Association.
Cite (Informal):
TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation (Nagoudi et al., OSACT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.osact-1.1.pdf
Code
 ubc-nlp/turjuman