Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling

Alice Millour, Karën Fort


Abstract
Building representative linguistic resources and NLP tools for non-standardized languages is challenging: when spelling is not determined by a norm, multiple written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourced alternative spellings we use to extract rules applied to match OOV words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without expert rule definition. We apply this multilingual methodology on Alsatian, a French regional language and provide an intrinsic evaluation of the correctness of the variants pairs, and an extrinsic evaluation on a downstream task. We show that in a low-resource scenario, 145 inital pairs can lead to the generation of 876 additional variant pairs, and a diminution of OOV words improving the part-of-speech tagging performance by 1 to 4%.
Anthology ID:
R19-1090
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
776–784
Language:
URL:
https://aclanthology.org/R19-1090
DOI:
10.26615/978-954-452-056-4_090
Bibkey:
Cite (ACL):
Alice Millour and Karën Fort. 2019. Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 776–784, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling (Millour & Fort, RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/R19-1090.pdf