Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models

Chennuru Rahul; Kolawole Adebayo

Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models

Abstract

Multilingual transformers have achieved re-markable performance on code-mixed senti-ment benchmarks, but their robustness underlinguistic stress and domain shift remains un-derexplored. We fine-tune XLM-RoBERTaand mBERT on a carefully cleaned 25,543-tweet Hinglish sentiment dataset, where XLM-R achieves near-perfect in-distribution accu-racy (99.7%). The integrity of this result isconfirmed by rigorous hash-based and 3-gramJaccard deduplication, ruling out data leakage.However, when evaluated on a 400-examplehuman-validated adversarial benchmark span-ning negation, sarcasm, contrast, subtle senti-ment, and true neutral, XLM-R performancecollapses to 42.5% – a drop of over 57 per-centage points. Zero-shot transfer to EnglishTweetEval yields only 50.8% accuracy (40.8%macro F1), above . Our results highlight a crit-ical gap between benchmark scores and real-world reliability, underscoring the need for ad-versarial evaluation and cross-domain stress-testing before deploying sentiment models inpractical, safety-sensitive applications.

Anthology ID:: 2026.dravidianlangtech-1.2
Volume:: Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Month:: July
Year:: 2026
Address:: Underline (Virtual)
Editors:: Bharathi Raja Chakravarthi, Ruba Priyadharshini, Anand Kumar Madasamy, Sajeetha Thavareesan, Saranya Rajiakodi, Subalalitha Navaneethakrishnan, Dhivya Chinnappa, Balasubramanian Palani, Malliga Subramanian, Kogilavani Shanmugavadivel, Ratnavel Rajalakshmi
Venues:: DravidianLangTech | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6–13
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.dravidianlangtech-1.2/
DOI:
Bibkey:
Cite (ACL):: Chennuru Rahul and Kolawole Adebayo. 2026. Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models. In Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, pages 6–13, Underline (Virtual). Association for Computational Linguistics.
Cite (Informal):: Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models (Rahul & Adebayo, DravidianLangTech 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.dravidianlangtech-1.2.pdf

PDF Cite Search Fix data