Clément Lefebvre


2022

pdf
LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish
Cedric Lothritz | Bertrand Lebichot | Kevin Allix | Lisa Veiber | Tegawende Bissyande | Jacques Klein | Andrey Boytsov | Clément Lefebvre | Anne Goujon
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.