Abstract
This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks. Given that out-of-vocabulary (OOV) words generally present a key open challenge to NLP and machine translation systems, especially toward the lower limit of resource availability, there are useful practical insights, as well as corpus-linguistic insights, from both a detailed manual and automatic taxonomic analysis of the types, multidimensional properties, and processing potential for multiple representative OOV data samples.- Anthology ID:
- 2022.coling-1.472
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 5309–5326
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.472
- DOI:
- Cite (ACL):
- Georgie Botev, Arya D. McCarthy, Winston Wu, and David Yarowsky. 2022. Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5309–5326, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages (Botev et al., COLING 2022)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2022.coling-1.472.pdf