BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Nikola Ljubešić; Davor Lauc

BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Abstract

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR - a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTić model is made available for free usage and further task-specific fine-tuning through HuggingFace.

Anthology ID:: 2021.bsnlp-1.5
Volume:: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Month:: April
Year:: 2021
Address:: Kiyv, Ukraine
Venue:: BSNLP
SIG:: SIGSLAV
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37–42
Language:
URL:: https://aclanthology.org/2021.bsnlp-1.5
DOI:
Bibkey:
Cite (ACL):: Nikola Ljubešić and Davor Lauc. 2021. BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 37–42, Kiyv, Ukraine. Association for Computational Linguistics.
Cite (Informal):: BERTić - The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian (Ljubešić & Lauc, BSNLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2021.bsnlp-1.5.pdf
Data: COPA

PDF Search