Tom Bidewell
2026
SMASH at SemEval-2026 Task 9: Detecting Multilingual Polarisation with Encoder Ensembles and Calibrated Decision Thresholds
Zahra Bokaei | Alessandra Terranova | Yi Zheng | Tom Bidewell | Bjorn Ross
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Zahra Bokaei | Alessandra Terranova | Yi Zheng | Tom Bidewell | Bjorn Ross
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
This paper describes the SMASH submission to SemEval-2026 Task~9 on multilingual, multicultural, and multi-event polarisation detection. The task comprises (i) binary polarisation detection, (ii) multi-label classification of polarisation types, and (iii) multi-label identification of polarisation manifestations across all available languages. We propose a language-adaptive ensemble framework combining monolingual and multilingual encoder-only transformers, together with a principled out-of-fold (OOF) threshold tuning strategy. Instead of relying on fixed probability thresholds, we jointly tune ensemble weights and class-wise decision thresholds to directly optimise macro-F1 under the official evaluation metric. Our experiments show that (1) monolingual encoders dominate in several high-resource languages but benefit from complementary multilingual signals, (2) no single multilingual backbone universally outperforms others across languages and subtasks, and (3) language-specific class threshold tuning substantially improves performance due to large cross-lingual variation in class distributions. Our results demonstrate that careful logit-level ensembling and threshold tuning provide strong performance for multilingual, imbalanced, multi-label polarisation detection. Across 22 evaluation languages, SMASH ranks among the top three systems in a substantial number of language–subtask pairs. Specifically, it ranks in the top three for 5 languages in Subtask 1, 14 languages in Subtask 2, and 16 languages in Subtask 3, demonstrating strong and consistent performance across diverse languages and tasks. Our system achieves average macro-F1 scores of 0.81, 0.62, and 0.53 for Subtasks 1, 2, and 3, respectively.