Modeling Orthographic Variation in Occitan’s Dialects

Zachary Hopton, Noëmi Aepli


Abstract
Effectively normalizing spellings in textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model’s representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects.Intrinsic evaluations of the model’s embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
Anthology ID:
2024.vardial-1.6
Volume:
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–88
Language:
URL:
https://aclanthology.org/2024.vardial-1.6
DOI:
Bibkey:
Cite (ACL):
Zachary Hopton and Noëmi Aepli. 2024. Modeling Orthographic Variation in Occitan’s Dialects. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 78–88, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Modeling Orthographic Variation in Occitan’s Dialects (Hopton & Aepli, VarDial-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.vardial-1.6.pdf
Supplementary material:
 2024.vardial-1.6.SupplementaryMaterial.txt