Nur Insyirah Binte Imam Mujtahid
2026
Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus
Svetlana Churina | Akshat Gupta | Nur Insyirah Binte Imam Mujtahid | Kokil Jaidka
Findings of the Association for Computational Linguistics: ACL 2026
Svetlana Churina | Akshat Gupta | Nur Insyirah Binte Imam Mujtahid | Kokil Jaidka
Findings of the Association for Computational Linguistics: ACL 2026
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. It includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.