Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek
Mukhammadsaid Mamasaidov, Azizullah Aral, Abror Shopulatov, Mironshoh Inomjonov
Abstract
Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.- Anthology ID:
- 2025.wmt-1.83
- Volume:
- Proceedings of the Tenth Conference on Machine Translation
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
- Venue:
- WMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1081–1087
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.83/
- DOI:
- Cite (ACL):
- Mukhammadsaid Mamasaidov, Azizullah Aral, Abror Shopulatov, and Mironshoh Inomjonov. 2025. Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek. In Proceedings of the Tenth Conference on Machine Translation, pages 1081–1087, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek (Mamasaidov et al., WMT 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.83.pdf