The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR

Injy Hamed, Thang Vu, Nizar Habash


Abstract
Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.
Anthology ID:
2025.calcs-1.2
Volume:
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, USA
Editors:
Genta Indra Winata, Sudipta Kar, Marina Zhukova, Thamar Solorio, Xi Ai, Injy Hamed, Mahardika Krisna Krisna Ihsani, Derry Tanti Wijaya, Garry Kuwanto
Venues:
CALCS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6–17
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.calcs-1.2/
DOI:
Bibkey:
Cite (ACL):
Injy Hamed, Thang Vu, and Nizar Habash. 2025. The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR. In Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching, pages 6–17, Albuquerque, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR (Hamed et al., CALCS 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.calcs-1.2.pdf