Large-scale Machine Translation for Indian Languages in E-commerce under Low Resource Constraints

Amey Patil, Nikesh Garera


Abstract
The democratization of e-commerce platforms has moved an increasingly diversified Indian user base to shop online. We have deployed reliable and precise large-scale Machine Translation systems for several Indian regional languages in this work. Building such systems is a challenge because of the low-resource nature of the Indian languages. We develop a structured model development pipeline as a closed feedback loop with external manual feedback through an Active Learning component. We show strong synthetic parallel data generation capability and consistent improvements to the model over iterations. Starting with 1.2M parallel pairs for English-Hindi we have compiled a corpus with 400M+ synthetic high quality parallel pairs across different domains. Further, we need colloquial translations to preserve the intent and friendliness of English content in regional languages, and make it easier to understand for our users. We perform robust and effective domain adaptation steps to achieve colloquial such translations. Over iterations, we show 9.02 BLEU points improvement for English to Hindi translation model. Along with Hindi, we show that the overall approach and best practices extends well to other Indian languages, resulting in deployment of our models across 7 Indian Languages.
Anthology ID:
2022.emnlp-industry.64
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2022
Address:
Abu Dhabi, UAE
Editors:
Yunyao Li, Angeliki Lazaridou
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
627–634
Language:
URL:
https://aclanthology.org/2022.emnlp-industry.64
DOI:
10.18653/v1/2022.emnlp-industry.64
Bibkey:
Cite (ACL):
Amey Patil and Nikesh Garera. 2022. Large-scale Machine Translation for Indian Languages in E-commerce under Low Resource Constraints. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 627–634, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Large-scale Machine Translation for Indian Languages in E-commerce under Low Resource Constraints (Patil & Garera, EMNLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2022.emnlp-industry.64.pdf