Kolawole Adebayo
2026
Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models
Chennuru Rahul | Kolawole Adebayo
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Chennuru Rahul | Kolawole Adebayo
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Multilingual transformers have achieved re-markable performance on code-mixed senti-ment benchmarks, but their robustness underlinguistic stress and domain shift remains un-derexplored. We fine-tune XLM-RoBERTaand mBERT on a carefully cleaned 25,543-tweet Hinglish sentiment dataset, where XLM-R achieves near-perfect in-distribution accu-racy (99.7%). The integrity of this result isconfirmed by rigorous hash-based and 3-gramJaccard deduplication, ruling out data leakage.However, when evaluated on a 400-examplehuman-validated adversarial benchmark span-ning negation, sarcasm, contrast, subtle senti-ment, and true neutral, XLM-R performancecollapses to 42.5% – a drop of over 57 per-centage points. Zero-shot transfer to EnglishTweetEval yields only 50.8% accuracy (40.8%macro F1), above . Our results highlight a crit-ical gap between benchmark scores and real-world reliability, underscoring the need for ad-versarial evaluation and cross-domain stress-testing before deploying sentiment models inpractical, safety-sensitive applications.
2023
DCU at SemEval-2023 Task 10: A Comparative Analysis of Encoder-only and Decoder-only Language Models with Insights into Interpretability
Kanishk Verma | Kolawole Adebayo | Joachim Wagner | Brian Davis
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Kanishk Verma | Kolawole Adebayo | Joachim Wagner | Brian Davis
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
We conduct a comparison of pre-trained encoder-only and decoder-only language models with and without continued pre-training, to detect online sexism. Our fine-tuning-based classifier system achieved the 16th rank in the SemEval 2023 Shared Task 10 Subtask A that asks to distinguish sexist and non-sexist texts. Additionally, we conduct experiments aimed at enhancing the interpretability of systems designed to detect online sexism. Our findings provide insights into the features and decision-making processes underlying our classifier system, thereby contributing to a broader effort to develop explainable AI models to detect online sexism.
2022
Proceedings of the First Workshop on Language Technology and Resources for a Fair, Inclusive, and Safe Society within the 13th Language Resources and Evaluation Conference
Kolawole Adebayo | Rohan Nanda | Kanishk Verma | Brian Davis
Proceedings of the First Workshop on Language Technology and Resources for a Fair, Inclusive, and Safe Society within the 13th Language Resources and Evaluation Conference
Kolawole Adebayo | Rohan Nanda | Kanishk Verma | Brian Davis
Proceedings of the First Workshop on Language Technology and Resources for a Fair, Inclusive, and Safe Society within the 13th Language Resources and Evaluation Conference