EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science Learning

Ruishi Chen, Yiling Zhao


Abstract
This paper presents EduCSW, a novel pipeline for generating Mandarin-English code-switched text to support AI-powered educational tools that adapt computer science instruction to learners’ language proficiency through mixed-language delivery. To address the scarcity of code-mixed datasets, we propose an encoder-decoder architecture that generates natural code-switched text using only minimal existing code-mixed examples and parallel corpora. Evaluated on a corpus curated for computer science education, human annotators rated 60–64% of our model’s outputs as natural, significantly outperforming both a baseline fine-tuned neural machine translation (NMT) model (22–24%) and the DeepSeek-R1 model (34–44%). The generated text achieves a Code-Mixing Index (CMI) of 25.28%, aligning with patterns observed in spontaneous Mandarin-English code-switching. Designed to be generalizable across language pairs and domains, this pipeline lays the groundwork for generating training data to support the development of educational tools with dynamic code-switching capabilities.
Anthology ID:
2025.bea-1.68
Volume:
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Ekaterina Kochmar, Bashar Alhafni, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan
Venues:
BEA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
908–919
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.bea-1.68/
DOI:
Bibkey:
Cite (ACL):
Ruishi Chen and Yiling Zhao. 2025. EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science Learning. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 908–919, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science Learning (Chen & Zhao, BEA 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.bea-1.68.pdf