Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark
Milena Stróżyna, Włodzimierz Lewoniewski, Izabela Czumałowska
Abstract
We present a multilingual study of sentiment evaluation on Wikipedia articles from various topics in five languages (German, English,Spanish, Polish, and Russian). In this paper, we compare three large language models (Gemini Pro 3.1, Claude Opus 4.6, and GPT 5.2),each queried three times per sentence, with two popular multilingual sentiment classifiers. This setup allows us to analyze not only inter-model differences but also intra-model stability as a proxy for confidence.To support systematic evaluation, we construct a benchmark dataset based on strict consensus across evaluators and analyze sentiment distributions across topics and languages. We show substantial variation in sentiment distributions, agreement, and consistency across models and languages. Our results suggest that sentiment evaluation on encyclopedic text remains an underexplored challenge for multilingual NLP.- Anthology ID:
- 2026.gem-main.63
- Volume:
- Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 692–703
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.63/
- DOI:
- Cite (ACL):
- Milena Stróżyna, Włodzimierz Lewoniewski, and Izabela Czumałowska. 2026. Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 692–703, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark (Stróżyna et al., GEM 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.63.pdf