Abstract
The democratization of e-commerce platforms has moved an increasingly diversified Indian user base to shop online. We have deployed reliable and precise large-scale Machine Translation systems for several Indian regional languages in this work. Building such systems is a challenge because of the low-resource nature of the Indian languages. We develop a structured model development pipeline as a closed feedback loop with external manual feedback through an Active Learning component. We show strong synthetic parallel data generation capability and consistent improvements to the model over iterations. Starting with 1.2M parallel pairs for English-Hindi we have compiled a corpus with 400M+ synthetic high quality parallel pairs across different domains. Further, we need colloquial translations to preserve the intent and friendliness of English content in regional languages, and make it easier to understand for our users. We perform robust and effective domain adaptation steps to achieve colloquial such translations. Over iterations, we show 9.02 BLEU points improvement for English to Hindi translation model. Along with Hindi, we show that the overall approach and best practices extends well to other Indian languages, resulting in deployment of our models across 7 Indian Languages.- Anthology ID:
- 2022.emnlp-industry.64
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, UAE
- Editors:
- Yunyao Li, Angeliki Lazaridou
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 627–634
- Language:
- URL:
- https://aclanthology.org/2022.emnlp-industry.64
- DOI:
- 10.18653/v1/2022.emnlp-industry.64
- Cite (ACL):
- Amey Patil and Nikesh Garera. 2022. Large-scale Machine Translation for Indian Languages in E-commerce under Low Resource Constraints. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 627–634, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Large-scale Machine Translation for Indian Languages in E-commerce under Low Resource Constraints (Patil & Garera, EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2022.emnlp-industry.64.pdf