Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages

Sina Ahmadi, Razhan Hameed, Rico Sennrich


Abstract
Middle Eastern languages represent a linguistically diverse landscape, yet few have received substantial attention in language and speech technology outside those with official status. Machine translation, a cornerstone application in computational linguistics, remains particularly underexplored for these predominantly non-standardized, spoken varieties. This paper proposes data alignment and augmentation techniques that leverage monolingual corpora and large language models to create high-quality parallel corpora for low-resource Middle Eastern languages. Through systematic fine-tuning of a pretrained machine translation model in a multilingual framework, our results demonstrate that corpus quality consistently outperforms quantity as a determinant of translation accuracy. Furthermore, we provide empirical evidence that strategic data selection significantly enhances cross-lingual transfer in multilingual translation systems. These findings offer valuable insights for developing machine translation solutions in linguistically diverse, resource-constrained environments.
Anthology ID:
2025.iwslt-1.10
Volume:
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria (in-person and online)
Editors:
Elizabeth Salesky, Marcello Federico, Antonis Anastasopoulos
Venues:
IWSLT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
110–118
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.iwslt-1.10/
DOI:
Bibkey:
Cite (ACL):
Sina Ahmadi, Razhan Hameed, and Rico Sennrich. 2025. Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages. In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 110–118, Vienna, Austria (in-person and online). Association for Computational Linguistics.
Cite (Informal):
Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages (Ahmadi et al., IWSLT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.iwslt-1.10.pdf