Abstract
Building representative linguistic resources and NLP tools for non-standardized languages is challenging: when spelling is not determined by a norm, multiple written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourced alternative spellings we use to extract rules applied to match OOV words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without expert rule definition. We apply this multilingual methodology on Alsatian, a French regional language and provide an intrinsic evaluation of the correctness of the variants pairs, and an extrinsic evaluation on a downstream task. We show that in a low-resource scenario, 145 inital pairs can lead to the generation of 876 additional variant pairs, and a diminution of OOV words improving the part-of-speech tagging performance by 1 to 4%.- Anthology ID:
- R19-1090
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
- Month:
- September
- Year:
- 2019
- Address:
- Varna, Bulgaria
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 776–784
- Language:
- URL:
- https://aclanthology.org/R19-1090
- DOI:
- 10.26615/978-954-452-056-4_090
- Cite (ACL):
- Alice Millour and Karën Fort. 2019. Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 776–784, Varna, Bulgaria. INCOMA Ltd..
- Cite (Informal):
- Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling (Millour & Fort, RANLP 2019)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/R19-1090.pdf