Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages

Georgie Botev; Arya D. McCarthy; Winston Wu; David Yarowsky

Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages

Georgie Botev, Arya D. McCarthy, Winston Wu, David Yarowsky

Abstract

This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks. Given that out-of-vocabulary (OOV) words generally present a key open challenge to NLP and machine translation systems, especially toward the lower limit of resource availability, there are useful practical insights, as well as corpus-linguistic insights, from both a detailed manual and automatic taxonomic analysis of the types, multidimensional properties, and processing potential for multiple representative OOV data samples.

Anthology ID:: 2022.coling-1.472
Volume:: Proceedings of the 29th International Conference on Computational Linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 5309–5326
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.coling-1.472/
DOI:
Bibkey:
Cite (ACL):: Georgie Botev, Arya D. McCarthy, Winston Wu, and David Yarowsky. 2022. Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5309–5326, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):: Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages (Botev et al., COLING 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2022.coling-1.472.pdf

PDF Cite Search Fix data