Abstract
This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.- Anthology ID:
- W17-1405
- Volume:
- Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
- Month:
- April
- Year:
- 2017
- Address:
- Valencia, Spain
- Editors:
- Tomaž Erjavec, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
- Venue:
- BSNLP
- SIG:
- SIGSLAV
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27–32
- Language:
- URL:
- https://aclanthology.org/W17-1405
- DOI:
- 10.18653/v1/W17-1405
- Cite (ACL):
- Achim Rabus and Yves Scherrer. 2017. Lexicon Induction for Spoken Rusyn – Challenges and Results. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 27–32, Valencia, Spain. Association for Computational Linguistics.
- Cite (Informal):
- Lexicon Induction for Spoken Rusyn – Challenges and Results (Rabus & Scherrer, BSNLP 2017)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/W17-1405.pdf
- Data
- MULTEXT-East