Bertrand Lebichot
2022
LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish
Cedric Lothritz
|
Bertrand Lebichot
|
Kevin Allix
|
Lisa Veiber
|
Tegawende Bissyande
|
Jacques Klein
|
Andrey Boytsov
|
Clément Lefebvre
|
Anne Goujon
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.
Search
Co-authors
- Cedric Lothritz 1
- Kevin Allix 1
- Lisa Veiber 1
- Tegawende Bissyande 1
- Jacques Klein 1
- show all...
Venues
- lrec1