Lost in Translation No More: Fine-tuned transformer-based models for CodeMix to English Machine Translation

Arindam Chatterjee, Chhavi Sharma, Yashwanth V.p., Niraj Kumar, Ayush Raj, Asif Ekbal


Abstract
Codemixing, the linguistic phenomenon where a speaker alternates between two or more languages within a conversation or even a single utterance, presents a significant challenge for machine translation systems due to its syntactic complexity and contextual nuances. This paper introduces a set of advanced transformerbased models fine-tuned specifically for translating codemixed text to English, more specifically, Hindi-English (colloquially referred to as Hinglish) codemixed text into English. Unlike standard bilingual corpora, codemixed data requires an understanding of the intricacies of grammatical structures and cultural contexts embedded within the language blend. Existing machine translation efforts in codemixed languages have largely been constrained by the paucity of robust datasets and models that can capture the nuanced semantic and syntactic interplay characteristic of such languages. We present a novel dataset PACMAN trans for Hinglish to English machine translation, based on the PACMAN strategy, meticulously curated to represent natural codemixing patterns. Our generic fine-tuned translation models trained on the novel data outperforms current state-of-theart Large Language Models (LLMs) by 38% in terms of BLEU score. Further, when fine-tuned on custom benchmark datasets, our focused dual fine-tuned models surpass the PHINC dataset BLEU score benchmark by 22%. Our comparative analysis illustrates significant improvements in translation quality, showcasing the potential of fine-tuning transformer models in bridging the linguistic divide in codemixed language translation. The success of our models reflects a promising step forward in the quest to provide seamless translation services for the ever-growing multilingual population and the complex linguistic phenomena they generate.
Anthology ID:
2023.icon-1.25
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
Jyoti D. Pawar, Sobha Lalitha Devi
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
326–335
Language:
URL:
https://aclanthology.org/2023.icon-1.25
DOI:
Bibkey:
Cite (ACL):
Arindam Chatterjee, Chhavi Sharma, Yashwanth V.p., Niraj Kumar, Ayush Raj, and Asif Ekbal. 2023. Lost in Translation No More: Fine-tuned transformer-based models for CodeMix to English Machine Translation. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 326–335, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
Lost in Translation No More: Fine-tuned transformer-based models for CodeMix to English Machine Translation (Chatterjee et al., ICON 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2023.icon-1.25.pdf