TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation
El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed
Abstract
We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).- Anthology ID:
- 2022.osact-1.1
- Volume:
- Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Venue:
- OSACT
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 1–11
- Language:
- URL:
- https://aclanthology.org/2022.osact-1.1
- DOI:
- Cite (ACL):
- El Moatez Billah Nagoudi, AbdelRahim Elmadany, and Muhammad Abdul-Mageed. 2022. TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, pages 1–11, Marseille, France. European Language Resources Association.
- Cite (Informal):
- TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation (Nagoudi et al., OSACT 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.osact-1.1.pdf
- Code
- ubc-nlp/turjuman