KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora

Jin-seo Kim, Anna Seo Gyeong Choi, Sunghye Cho


Abstract
Word frequencies are integral in linguistic studies, showing strong correlations with speakers’ cognitive abilities and other important linguistic parameters including the Age of Acquisition (AoA). However, the formulation of credible Korean word frequency norms has been obstructed by the lack of expansive speech data and a reliable part-ofspeech (POS) tagger. In this study, we unveil Korean word frequency norms (KoFREN), derived from large-scale spontaneous speech corpora (41 million words) that include a balanced representation of gender and age. We employed a machine learning-powered POS tagger, showcasing accuracy on par with human annotators. Our frequency norms correlate significantly with external studies’ lexical decision time (LDT) and AoA measures. KoFREN also aligns with English counterparts sourced from SUBTLEX_US - an English word frequency measure that has been frequently used in the literature. KoFREN is poised to facilitate research in spontaneous Contemporary Korean and can be utilized in many fields, including clinical studies of Korean patients.
Anthology ID:
2024.lrec-main.866
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
9926–9931
Language:
URL:
https://aclanthology.org/2024.lrec-main.866
DOI:
Bibkey:
Cite (ACL):
Jin-seo Kim, Anna Seo Gyeong Choi, and Sunghye Cho. 2024. KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9926–9931, Torino, Italia. ELRA and ICCL.
Cite (Informal):
KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora (Kim et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.866.pdf