CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

Mohsin Mohammed, Sai Kandukuri, Neeharika Gupta, Parth Patwa, Anubhab Chatterjee, Vinija Jain, Aman Chadha, Amitava Das


Abstract
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.
Anthology ID:
2023.calcs-1.6
Volume:
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching
Month:
December
Year:
2023
Address:
Singapore
Editors:
Genta Winata, Sudipta Kar, Marina Zhukova, Thamar Solorio, Mona Diab, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
Venues:
CALCS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
64–73
Language:
URL:
https://aclanthology.org/2023.calcs-1.6
DOI:
10.18653/v1/2023.calcs-1.6
Bibkey:
Cite (ACL):
Mohsin Mohammed, Sai Kandukuri, Neeharika Gupta, Parth Patwa, Anubhab Chatterjee, Vinija Jain, Aman Chadha, and Amitava Das. 2023. CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 64–73, Singapore. Association for Computational Linguistics.
Cite (Informal):
CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling (Mohammed et al., CALCS-WS 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2023.calcs-1.6.pdf