Mocktails of Translation, Ensemble Learning and Embeddings to tackle Hinglish NLP challenges

Lov Kumar, Vikram Singh, Proksh, Pratyush Mishra


Abstract
Social media has become a global platform where users express opinions on diverse contemporary topics, often blending dominant languages with native tongues, leading to code-mixed, context-rich content. A typical example is Hinglish, where Hindi elements are embedded in English texts. This linguistic mixture challenges traditional NLP systems, which rely on monolingual resources and need help to process multilingual content. Sentiment analysis for code-mixed data, mainly involving Indian languages, remains largely unexplored. This paper introduces a novel approach for sentiment analysis of code-mixed Hinglish data, combining translation, different stacking classifier architectures, and embedding techniques. We utilize pre-trained LoRA weights of a fine-tuned Gemma-2B model to translate Hinglish into English, followed by the employment of four pre-trained meta embeddings: GloVe-T, Word2Vec, TF-IDF, and fastText. SMOTE is applied to balance skewed data, and dimensionality reduction is performed before implementing machine learning models and stacking classifier ensembles. Three ensemble architectures, combining 22 base classifiers with a Logistic Regression meta-classifier, test different meta-embedding combinations. Experimental results show that the TF-W2V-FST (TF-IDF, Word2Vec, fastText) combination performs best, with SVM radial bias achieving the highest accuracy 91.53% and AUC (0.96). This research contributes a novel and effective technique to sentiment analysis for code-mixed data.
Anthology ID:
2024.icon-1.40
Volume:
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2024
Address:
AU-KBC Research Centre, Chennai, India
Editors:
Sobha Lalitha Devi, Karunesh Arora
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
593–601
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.icon-1.40/
DOI:
Bibkey:
Cite (ACL):
Lov Kumar, Vikram Singh, Proksh, and Pratyush Mishra. 2024. Mocktails of Translation, Ensemble Learning and Embeddings to tackle Hinglish NLP challenges. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 593–601, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):
Mocktails of Translation, Ensemble Learning and Embeddings to tackle Hinglish NLP challenges (Kumar et al., ICON 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.icon-1.40.pdf