Abstract
Code-Mixing, the act of mixing two or more languages, is a common communicative phenomenon in multi-lingual societies. The lack of quality in code-mixed data is a bottleneck for NLP systems. On the other hand, Monolingual systems perform well due to ample high-quality data. To bridge the gap, creating coherent translations of monolingual sentences to their code-mixed counterparts can improve accuracy in code-mixed settings for NLP downstream tasks. In this paper, we propose a neural machine translation approach to generate high-quality code-mixed sentences by leveraging human judgements. We train filters based on human judgements to identify natural code-mixed sentences from a larger synthetically generated code-mixed corpus, resulting in a three-way silver parallel corpus between monolingual English, monolingual Indian language and code-mixed English with an Indian language. Using these corpora, we fine-tune multi-lingual encoder-decoder models viz, mT5 and mBART, for the translation task. Our results indicate that our approach of using filtered data for training outperforms the current systems for code-mixed generation in Hindi-English. Apart from Hindi-English, the approach performs well when applied to Telugu, a low-resource language, to generate Telugu-English code-mixed sentences.- Anthology ID:
- 2023.conll-1.15
- Volume:
- Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Jing Jiang, David Reitter, Shumin Deng
- Venue:
- CoNLL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 211–220
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2023.conll-1.15/
- DOI:
- 10.18653/v1/2023.conll-1.15
- Cite (ACL):
- Dama Sravani and Radhika Mamidi. 2023. Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 211–220, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation (Sravani & Mamidi, CoNLL 2023)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2023.conll-1.15.pdf