BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian

Peter Rupnik; Taja Kuzman; Nikola Ljubešić

doi:10.18653/v1/2023.vardial-1.11

BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian

Peter Rupnik, Taja Kuzman, Nikola Ljubešić

Abstract

Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.

Anthology ID:: 2023.vardial-1.11
Volume:: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 113–120
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.vardial-1.11/
DOI:: 10.18653/v1/2023.vardial-1.11
Bibkey:
Cite (ACL):: Peter Rupnik, Taja Kuzman, and Nikola Ljubešić. 2023. BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 113–120, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian (Rupnik et al., VarDial 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.vardial-1.11.pdf
Video:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.vardial-1.11.mp4

PDF Cite Search Video Fix data