Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data

Devansh Gautam; Kshitij Gupta; Manish Shrivastava

doi:10.18653/v1/2021.calcs-1.3

Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data

Devansh Gautam, Kshitij Gupta, Manish Shrivastava

Abstract

Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark - Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.

Anthology ID:: 2021.calcs-1.3
Volume:: Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Month:: June
Year:: 2021
Address:: Online
Editors:: Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
Venue:: CALCS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15–25
Language:
URL:: https://aclanthology.org/2021.calcs-1.3
DOI:: 10.18653/v1/2021.calcs-1.3
Bibkey:
Cite (ACL):: Devansh Gautam, Kshitij Gupta, and Manish Shrivastava. 2021. Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 15–25, Online. Association for Computational Linguistics.
Cite (Informal):: Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data (Gautam et al., CALCS 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl-24-ws-corrections/2021.calcs-1.3.pdf
Code: devanshg27/cm_translatify
Data: MultiNLI, SNLI

PDF Search Code