LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

Cedric Lothritz, Bertrand Lebichot, Kevin Allix, Lisa Veiber, Tegawende Bissyande, Jacques Klein, Andrey Boytsov, Clément Lefebvre, Anne Goujon


Abstract
Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.
Anthology ID:
2022.lrec-1.543
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5080–5089
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2022.lrec-1.543/
DOI:
Bibkey:
Cite (ACL):
Cedric Lothritz, Bertrand Lebichot, Kevin Allix, Lisa Veiber, Tegawende Bissyande, Jacques Klein, Andrey Boytsov, Clément Lefebvre, and Anne Goujon. 2022. LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5080–5089, Marseille, France. European Language Resources Association.
Cite (Informal):
LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish (Lothritz et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2022.lrec-1.543.pdf