Abstract
Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark - Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.- Anthology ID:
- 2021.calcs-1.3
- Volume:
- Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
- Venue:
- CALCS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15–25
- Language:
- URL:
- https://aclanthology.org/2021.calcs-1.3
- DOI:
- 10.18653/v1/2021.calcs-1.3
- Cite (ACL):
- Devansh Gautam, Kshitij Gupta, and Manish Shrivastava. 2021. Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 15–25, Online. Association for Computational Linguistics.
- Cite (Informal):
- Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data (Gautam et al., CALCS 2021)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2021.calcs-1.3.pdf
- Code
- devanshg27/cm_translatify
- Data
- MultiNLI, SNLI