PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages
Sina Ahmadi, Rico Sennrich, Erfan Karami, Ako Marani, Parviz Fekrazad, Gholamreza Akbarzadeh Baghban, Hanah Hadi, Semko Heidari, Mahîr Dogan, Pedram Asadi, Dashne Bashir, Mohammad Amin Ghodrati, Kourosh Amini, Zeynab Ashourinezhad, Mana Baladi, Farshid Ezzati, Alireza Ghasemifar, Daryoush Hosseinpour, Behrooz Abbaszadeh, Amin Hassanpour, Bahaddin Jalal Hamaamin, Saya Kamal Hama, Ardeshir Mousavi, Sarko Nazir Hussein, Isar Nejadgholi, Mehmet Ölmez, Horam Osmanpour, Rashid Roshan Ramezani, Aryan Sediq Aziz, Ali Salehi, Mohammadreza Yadegari, Kewyar Yadegari, Sedighe Zamani Roodsari
Abstract
The Middle East is characterized by remarkable linguistic diversity, with over 400 million inhabitants speaking more than 60 languages across multiple language families. This study presents a pioneering work in developing the first parallel corpora for eight severely under-resourced varieties in the region–PARME, addressing fundamental challenges in low-resource scenarios including non-standardized writing and dialectal complexity. Through an extensive community-driven initiative, volunteers contributed to the creation of over 36,000 translated sentences, marking a significant milestone in resource development. We evaluate machine translation capabilities through zero-shot approaches and fine-tuning experiments with pretrained machine translation models and provide a comprehensive analysis of limitations. Our findings reveal significant gaps in existing technologies for processing the selected languages, highlighting critical areas for improvement in language technology for Middle Eastern languages.- Anthology ID:
- 2025.acl-long.1451
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 30032–30053
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1451/
- DOI:
- Cite (ACL):
- Sina Ahmadi, Rico Sennrich, Erfan Karami, Ako Marani, Parviz Fekrazad, Gholamreza Akbarzadeh Baghban, Hanah Hadi, Semko Heidari, Mahîr Dogan, Pedram Asadi, Dashne Bashir, Mohammad Amin Ghodrati, Kourosh Amini, Zeynab Ashourinezhad, Mana Baladi, Farshid Ezzati, Alireza Ghasemifar, Daryoush Hosseinpour, Behrooz Abbaszadeh, Amin Hassanpour, Bahaddin Jalal Hamaamin, Saya Kamal Hama, Ardeshir Mousavi, Sarko Nazir Hussein, Isar Nejadgholi, Mehmet Ölmez, Horam Osmanpour, Rashid Roshan Ramezani, Aryan Sediq Aziz, Ali Salehi, Mohammadreza Yadegari, Kewyar Yadegari, and Sedighe Zamani Roodsari. 2025. PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30032–30053, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages (Ahmadi et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1451.pdf