Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus
Svetlana Churina, Akshat Gupta, Nur Insyirah Binte Imam Mujtahid, Kokil Jaidka
Abstract
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant-messaging exchanges, there has been a lack of publicly available corpora that are author-labeled and suitable for modeling human conversations and relationships. This study introduces the first labeled and general-purpose corpus for understanding code-mixing in context while maintaining rigorous privacy and ethical standards. It includes over 355,641 messages spanning various code-mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foundational dataset for research in computational linguistics, sociolinguistics, and NLP applications.- Anthology ID:
- 2026.findings-acl.80
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1602–1624
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.80/
- DOI:
- Cite (ACL):
- Svetlana Churina, Akshat Gupta, Nur Insyirah Binte Imam Mujtahid, and Kokil Jaidka. 2026. Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1602–1624, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus (Churina et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.80.pdf