100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov


Abstract
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001–2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks—three-way polarity classification and five-class score classification—and benchmark classical BoW/TF–IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.
Anthology ID:
2026.nlp4dh-1.4
Volume:
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Month:
July
Year:
2026
Address:
San Diego, USA
Editors:
Sil Hamilton, Emily Öhman, Rebecca M. M. Hicke, Yuri Bizzoni, Axel Bax, Jacob A. Matthews, Mika Hämäläinen
Venues:
NLP4DH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31–40
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.nlp4dh-1.4/
DOI:
Bibkey:
Cite (ACL):
Rustem Yeshpanov. 2026. 100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts. In Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities, pages 31–40, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):
100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts (Yeshpanov, NLP4DH 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.nlp4dh-1.4.pdf