Neural Text Normalization for Luxembourgish Using Real-Life Variation Data
Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank
Abstract
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.- Anthology ID:
- 2025.vardial-1.9
- Volume:
- Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
- Venues:
- VarDial | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 115–127
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.vardial-1.9/
- DOI:
- Cite (ACL):
- Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, and Barbara Plank. 2025. Neural Text Normalization for Luxembourgish Using Real-Life Variation Data. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 115–127, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Neural Text Normalization for Luxembourgish Using Real-Life Variation Data (Lutgen et al., VarDial 2025)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.vardial-1.9.pdf