Milena Stróżyna

2026

Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark
Milena Stróżyna | Włodzimierz Lewoniewski | Izabela Czumałowska
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)

We present a multilingual study of sentiment evaluation on Wikipedia articles from various topics in five languages (German, English,Spanish, Polish, and Russian). In this paper, we compare three large language models (Gemini Pro 3.1, Claude Opus 4.6, and GPT 5.2),each queried three times per sentence, with two popular multilingual sentiment classifiers. This setup allows us to analyze not only inter-model differences but also intra-model stability as a proxy for confidence.To support systematic evaluation, we construct a benchmark dataset based on strict consensus across evaluators and analyze sentiment distributions across topics and languages. We show substantial variation in sentiment distributions, agreement, and consistency across models and languages. Our results suggest that sentiment evaluation on encyclopedic text remains an underexplored challenge for multilingual NLP.

Co-authors

Venues

GEM1
WS1

Fix author