Abstract
Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.- Anthology ID:
- 2023.vardial-1.11
- Volume:
- Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
- Venue:
- VarDial
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 113–120
- Language:
- URL:
- https://aclanthology.org/2023.vardial-1.11
- DOI:
- 10.18653/v1/2023.vardial-1.11
- Cite (ACL):
- Peter Rupnik, Taja Kuzman, and Nikola Ljubešić. 2023. BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 113–120, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian (Rupnik et al., VarDial 2023)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2023.vardial-1.11.pdf