Transformer-based models have become the de facto standard in the field of Natural Language Processing (NLP). By leveraging large unlabeled text corpora, they enable efficient transfer learning leading to state-of-the-art results on numerous NLP tasks. Nevertheless, for low resource languages and highly specialized tasks, transformer models tend to lag behind more classical approaches (e.g. SVM, LSTM) due to the lack of aforementioned corpora. In this paper we focus on the legal domain and we introduce a Romanian BERT model pre-trained on a large specialized corpus. Our model outperforms several strong baselines for legal judgement prediction on two different corpora consisting of cases from trials involving banks in Romania.
Deep pre-trained language models tend to become ubiquitous in the field of Natural Language Processing (NLP). These models learn contextualized representations by using a huge amount of unlabeled text data and obtain state of the art results on a multitude of NLP tasks, by enabling efficient transfer learning. For other languages besides English, there are limited options of such models, most of which are trained only on multi-lingual corpora. In this paper we introduce a Romanian-only pre-trained BERT model – RoBERT – and compare it with different multi-lingual models on seven Romanian specific NLP tasks grouped into three categories, namely: sentiment analysis, dialect and cross-dialect topic identification, and diacritics restoration. Our model surpasses the multi-lingual models, as well as a another mono-lingual implementation of BERT, on all tasks.