Abstract
Code-mixing (CM), where speakers blend languages within a single expression, is prevalent in multilingual societies but poses challenges for natural language processing due to its complexity and limited data. We propose using a large language model to generate synthetic CM data, which is then used to enhance the performance of task-specific models for CM sentiment analysis. Our results show that in Spanish-English, synthetic data improved the F1 score by 9.32%, outperforming previous augmentation techniques. However, in Malayalam-English, synthetic data only helped when the baseline was low; with strong natural data, additional synthetic data offered little benefit. Human evaluation confirmed that this approach is a simple, cost-effective way to generate natural-sounding CM sentences, particularly beneficial for low baselines. Our findings suggest that few-shot prompting of large language models is a promising method for CM data augmentation and has significant impact on improving sentiment analysis, an important element in the development of social influence systems.- Anthology ID:
- 2024.sicon-1.6
- Volume:
- Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- James Hale, Kushal Chawla, Muskan Garg
- Venue:
- SICon
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 85–101
- Language:
- URL:
- https://aclanthology.org/2024.sicon-1.6
- DOI:
- 10.18653/v1/2024.sicon-1.6
- Cite (ACL):
- Linda Zeng. 2024. Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 85–101, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis (Zeng, SICon 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.sicon-1.6.pdf