Chennuru Rahul
2026
Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models
Chennuru Rahul | Kolawole Adebayo
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Chennuru Rahul | Kolawole Adebayo
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Multilingual transformers have achieved re-markable performance on code-mixed senti-ment benchmarks, but their robustness underlinguistic stress and domain shift remains un-derexplored. We fine-tune XLM-RoBERTaand mBERT on a carefully cleaned 25,543-tweet Hinglish sentiment dataset, where XLM-R achieves near-perfect in-distribution accu-racy (99.7%). The integrity of this result isconfirmed by rigorous hash-based and 3-gramJaccard deduplication, ruling out data leakage.However, when evaluated on a 400-examplehuman-validated adversarial benchmark span-ning negation, sarcasm, contrast, subtle senti-ment, and true neutral, XLM-R performancecollapses to 42.5% – a drop of over 57 per-centage points. Zero-shot transfer to EnglishTweetEval yields only 50.8% accuracy (40.8%macro F1), above . Our results highlight a crit-ical gap between benchmark scores and real-world reliability, underscoring the need for ad-versarial evaluation and cross-domain stress-testing before deploying sentiment models inpractical, safety-sensitive applications.