Neural Text Normalization for Luxembourgish Using Real-Life Variation Data

Anne-Marie Lutgen; Alistair Plum; Christoph Purschke; Barbara Plank

Neural Text Normalization for Luxembourgish Using Real-Life Variation Data

Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

Anthology ID:: 2025.vardial-1.9
Volume:: Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
Venues:: VarDial | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 115–127
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.vardial-1.9/
DOI:
Bibkey:
Cite (ACL):: Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, and Barbara Plank. 2025. Neural Text Normalization for Luxembourgish Using Real-Life Variation Data. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 115–127, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Neural Text Normalization for Luxembourgish Using Real-Life Variation Data (Lutgen et al., VarDial 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.vardial-1.9.pdf

PDF Cite Search Fix data