Czert – Czech BERT-like Model for Language Representation

Jakub Sido, Ondřej Pražák, Pavel Přibáň, Jan Pašek, Michal Seják, Miloslav Konopík


Abstract
This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.
Anthology ID:
2021.ranlp-1.149
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1326–1338
Language:
URL:
https://aclanthology.org/2021.ranlp-1.149
DOI:
Bibkey:
Cite (ACL):
Jakub Sido, Ondřej Pražák, Pavel Přibáň, Jan Pašek, Michal Seják, and Miloslav Konopík. 2021. Czert – Czech BERT-like Model for Language Representation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1326–1338, Held Online. INCOMA Ltd..
Cite (Informal):
Czert – Czech BERT-like Model for Language Representation (Sido et al., RANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2021.ranlp-1.149.pdf
Code
 kiv-air/Czert