Gyu-Ho Shin


2024

pdf
Constructing a Dependency Treebank for Second Language Learners of Korean
Hakyung Sung | Gyu-Ho Shin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce a manually annotated syntactic treebank based on Universal Dependencies, derived from the written data of second language (L2) Korean learners. In developing this new dataset, we critically evaluated previous works and revised the annotation guidelines to better reflect the linguistic properties of Korean and the characteristics of L2 learners. The L2 Korean treebank encompasses 7,530 sentences (66,982 words; 129,333 morphemes) and is publicly available at: https://github.com/NLPxL2Korean/L2KW-corpus.

2023

pdf
Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean
Hakyung Sung | Gyu-Ho Shin
Findings of the Association for Computational Linguistics: EMNLP 2023

This study investigates the extent to which currently available morpheme parsers/taggers apply to lesser-studied languages and language-usage contexts, with a focus on second language (L2) Korean. We pursue this inquiry by (1) training a neural-network model (pre-trained on first language [L1] Korean data) on varying L2 datasets and (2) measuring its morpheme parsing/POS tagging performance on L2 test sets from both the same and different sources of the L2 train sets. Results show that the L2 trained models generally excel in domain-specific tokenization and POS tagging compared to the L1 pre-trained baseline model. Interestingly, increasing the size of the L2 training data does not lead to improving model performance consistently.

pdf
Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners
Hakyung Sung | Gyu-Ho Shin
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

We introduce the Korean-Learner-Morpheme (KLM) corpus, a manually annotated dataset consisting of 129,784 morphemes from second language (L2) learners of Korean, featuring morpheme tokenization and part-of-speech (POS) tagging. We evaluate the performance of four Korean morphological analyzers in tokenization and POS tagging on the L2- Korean corpus. Results highlight the analyzers’ reduced performance on L2 data, indicating the limitation of advanced deep-learning models when dealing with L2-Korean corpora. We further show that fine-tuning one of the models with the KLM corpus improves its accuracy of tokenization and POS tagging on L2-Korean dataset.