Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling
Arindam Chatterjere, Vineeth Guptha, Parul Chopra, Amitava Das
Abstract
Code-Mixing (CM) or language mixing is a social norm in multilingual societies. CM is quite prevalent in social media conversations in multilingual regions like - India, Europe, Canada and Mexico. In this paper, we explore the problem of Language Modeling (LM) for code-mixed Hinglish text. In recent times, there have been several success stories with neural language modeling like Generative Pre-trained Transformer (GPT) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) etc.. Hence, neural language models have become the new holy grail of modern NLP, although LM for CM is an unexplored area altogether. To better understand the problem of LM for CM, we initially experimented with several statistical language modeling techniques and consequently experimented with contemporary neural language models. Analysis shows switching-points are the main challenge for the LMCM performance drop, therefore in this paper we introduce the idea of minority positive sampling to selectively induce more sample to achieve better performance. On the contrary, all neural language models demand a huge corpus to train on for better performance. Finally, we are reporting a perplexity of 139 for Hinglish (Hindi-English language pair) LMCM using statistical bi-directional techniques.- Anthology ID:
- 2020.lrec-1.764
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6228–6236
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.764
- DOI:
- Cite (ACL):
- Arindam Chatterjere, Vineeth Guptha, Parul Chopra, and Amitava Das. 2020. Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6228–6236, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Minority Positive Sampling for Switching Points - an Anecdote for the Code-Mixing Language Modeling (Chatterjere et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2020.lrec-1.764.pdf