Low Resource Chat Translation: A Benchmark for Hindi–English Language Pair

Baban Gain; Ramakrishna Appicharla; Soumya Chennabasavaraj; Nikesh Garera; Asif Ekbal; Muthusamy Chelliah

Low Resource Chat Translation: A Benchmark for Hindi–English Language Pair

Baban Gain, Ramakrishna Appicharla, Soumya Chennabasavaraj, Nikesh Garera, Asif Ekbal, Muthusamy Chelliah

Abstract

Chatbots or conversational systems are used in various sectors such as banking, healthcare, e-commerce, customer support, etc. These chatbots are mainly available for resource-rich languages like English, often limiting their widespread usage to multilingual users. Therefore, making these services or agents available in non-English languages has become essential for their broader applicability. Machine Translation (MT) could be an effective way to develop multilingual chatbots. Further, to help users be confident about a product, feedback and recommendation from the end-user community are essential. However, these question-answers (QnA) can be in a different language than the users. The use of MT systems can reduce these issues to a large extent. In this paper, we provide a benchmark setup for Chat and QnA translation for English-Hindi, a relatively low-resource language pair. We first create the English-Hindi parallel corpus comprising of synthetic and gold standard parallel sentences. Thereafter, we develop several sentence-level and context-level neural machine translation (NMT) models, and measure their effectiveness on the newly created datasets. We achieve a BLEU score of 58.7 and 62.6 on the English-Hindi and Hindi-English subset of the gold-standard version of the WMT20 Chat dataset. Further, we achieve BLEU scores of 52.9 and 76.9 on the gold-standard Multi-modal Dialogue Dataset (MMD) English-Hindi and Hindi-English datasets. For QnA, we achieve a BLEU score of 49.9. Further, we achieve BLEU scores of 50.3 and 50.4 on question and answers subsets, respectively. We also perform thorough qualitative analysis of the outputs by the real users.

Anthology ID:: 2022.amta-research.7
Volume:: Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:: September
Year:: 2022
Address:: Orlando, USA
Editors:: Kevin Duh, Francisco Guzmán
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 83–96
Language:
URL:: https://aclanthology.org/2022.amta-research.7
DOI:
Bibkey:
Cite (ACL):: Baban Gain, Ramakrishna Appicharla, Soumya Chennabasavaraj, Nikesh Garera, Asif Ekbal, and Muthusamy Chelliah. 2022. Low Resource Chat Translation: A Benchmark for Hindi–English Language Pair. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 83–96, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Low Resource Chat Translation: A Benchmark for Hindi–English Language Pair (Gain et al., AMTA 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2022.amta-research.7.pdf
Code: babangain/en_hi_chat_qna_translation
Data: BMELD, MMD, Samanantar, Taskmaster-1

PDF Search Code