Abstract
Deploying recent natural language processing innovations to low-resource settings allows for state-of-the-art research findings and applications to be accessed across cultural and linguistic borders. One low-resource setting of increasing interest is code-switching, the phenomenon of combining, swapping, or alternating the use of two or more languages in continuous dialogue. In this paper, we introduce a large dataset (20k+ instances) to facilitate investigation of Tagalog-English code-switching, which has become a popular mode of discourse in Philippine culture. Tagalog is an Austronesian language and former official language of the Philippines spoken by over 23 million people worldwide, but it and Tagalog-English are under-represented in NLP research and practice. We describe our methods for data collection, as well as our labeling procedures. We analyze our resulting dataset, and finally conclude by providing results from a proof-of-concept regression task to establish dataset validity, achieving a strong performance benchmark (R2=0.797-0.909; RMSE=0.068-0.057).- Anthology ID:
- 2022.lrec-1.225
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 2090–2097
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.225
- DOI:
- Cite (ACL):
- Megan Herrera, Ankit Aich, and Natalie Parde. 2022. TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2090–2097, Marseille, France. European Language Resources Association.
- Cite (Informal):
- TweetTaglish: A Dataset for Investigating Tagalog-English Code-Switching (Herrera et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/landing_page/2022.lrec-1.225.pdf
- Code
- meg2121/tweettaglish-dataset