Abstract
This study addresses the challenges of learning unsupervised word representations for the morphologically rich and low-resource Ukrainian language. Traditional models that perform decently on English do not generalize well for such languages due to a lack of sufficient data and the complexity of their grammatical structures. To overcome these challenges, we utilized a high-quality, large dataset of different genres for learning Ukrainian word vector representations. We found the best hyperparameters to train fastText language models on this dataset and performed intrinsic and extrinsic evaluations of the generated word embeddings using the established methods and metrics. The results of this study indicate that the trained vectors exhibit superior performance on intrinsic tests in comparison to existing embeddings for Ukrainian. Our best model gives 62% Accuracy on the word analogy task. Extrinsic evaluations were performed on two sequence labeling tasks: NER and POS tagging (83% spaCy NER F-score, 83% spaCy POS Accuracy, 92% Flair POS Accuracy).- Anthology ID:
- 2023.unlp-1.3
- Volume:
- Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editor:
- Mariana Romanyshyn
- Venue:
- UNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 20–31
- Language:
- URL:
- https://aclanthology.org/2023.unlp-1.3
- DOI:
- 10.18653/v1/2023.unlp-1.3
- Cite (ACL):
- Nataliia Romanyshyn, Dmytro Chaplynskyi, and Kyrylo Zakharov. 2023. Learning Word Embeddings for Ukrainian: A Comparative Study of FastText Hyperparameters. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 20–31, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Learning Word Embeddings for Ukrainian: A Comparative Study of FastText Hyperparameters (Romanyshyn et al., UNLP 2023)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2023.unlp-1.3.pdf