Abstract
In this paper, we detail our work on comparing different word-level language identification systems for code-switched Hindi-English data and a standard Spanish-English dataset. In this regard, we build a new code-switched dataset for Hindi-English. To understand the code-switching patterns in these language pairs, we investigate different code-switching metrics. We find that the CRF model outperforms the neural network based models by a margin of 2-5 percentage points for Spanish-English and 3-5 percentage points for Hindi-English.- Anthology ID:
- W18-3206
- Volume:
- Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
- Month:
- July
- Year:
- 2018
- Address:
- Melbourne, Australia
- Editors:
- Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Thamar Solorio, Mona Diab, Julia Hirschberg
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 51–61
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/W18-3206/
- DOI:
- 10.18653/v1/W18-3206
- Cite (ACL):
- Deepthi Mave, Suraj Maharjan, and Thamar Solorio. 2018. Language Identification and Analysis of Code-Switched Social Media Text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 51–61, Melbourne, Australia. Association for Computational Linguistics.
- Cite (Informal):
- Language Identification and Analysis of Code-Switched Social Media Text (Mave et al., ACL 2018)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/W18-3206.pdf