Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data

Arra’Di Nur Rizal; Sara Stymne

Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data

Abstract

Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.

Anthology ID:: 2020.calcs-1.4
Volume:: Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: CALCS
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 26–35
Language:: English
URL:: https://aclanthology.org/2020.calcs-1.4
DOI:
Bibkey:
Cite (ACL):: Arra’Di Nur Rizal and Sara Stymne. 2020. Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data. In Proceedings of the The 4th Workshop on Computational Approaches to Code Switching, pages 26–35, Marseille, France. European Language Resources Association.
Cite (Informal):: Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data (Rizal & Stymne, CALCS 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/auto-file-uploads/2020.calcs-1.4.pdf

PDF Search