Training Deployable General Domain MT for a Low Resource Language Pair: English-Bangla

Sandipan Dandapat, William Lewis


Abstract
A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.
Anthology ID:
2018.eamt-main.11
Volume:
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Month:
May
Year:
2018
Address:
Alicante, Spain
Editors:
Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Miquel Esplà-Gomis, Maja Popović, Celia Rico, André Martins, Joachim Van den Bogaert, Mikel L. Forcada
Venue:
EAMT
SIG:
Publisher:
Note:
Pages:
129–138
Language:
URL:
https://aclanthology.org/2018.eamt-main.11
DOI:
Bibkey:
Cite (ACL):
Sandipan Dandapat and William Lewis. 2018. Training Deployable General Domain MT for a Low Resource Language Pair: English-Bangla. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pages 129–138, Alicante, Spain.
Cite (Informal):
Training Deployable General Domain MT for a Low Resource Language Pair: English-Bangla (Dandapat & Lewis, EAMT 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2018.eamt-main.11.pdf