Automatic dialect classification for statistical machine translation

Saab Mansour, Yaser Al-Onaizan, Graeme Blackwood, Christoph Tillmann


Abstract
The training data for statistical machine translation are gathered from various sources representing a mixture of domains. In this work, we argue that when translating dialects representing varieties of the same language, a manually assigned data source is not a reliable indicator of the dialect. We resort to automatic dialect classification to refine the training corpora according to the different dialects and build improved dialect specific systems. A fairly standard classifier for Arabic developed within this work achieves state-of-the-art performance, with classification precision above 90%, making it usefully accurate for our application. The classification of the data is then used to distinguish between the different dialects, split the data accordingly, and utilize the new splits for several adaptation techniques. Performing translation experiments on a large scale dialectal Arabic to English translation task, our results show that the classifier generates better contrast between the dialects and achieves superior translation quality than using the original manual corpora splits.
Anthology ID:
2014.amta-researchers.26
Volume:
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Month:
October 22-26
Year:
2014
Address:
Vancouver, Canada
Editors:
Yaser Al-Onaizan, Michel Simard
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
342–355
Language:
URL:
https://aclanthology.org/2014.amta-researchers.26
DOI:
Bibkey:
Cite (ACL):
Saab Mansour, Yaser Al-Onaizan, Graeme Blackwood, and Christoph Tillmann. 2014. Automatic dialect classification for statistical machine translation. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pages 342–355, Vancouver, Canada. Association for Machine Translation in the Americas.
Cite (Informal):
Automatic dialect classification for statistical machine translation (Mansour et al., AMTA 2014)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2014.amta-researchers.26.pdf