Gyu-Ho Shin

2025

pdf bib abs
Polysemy Interpretation and Transformer Language Models: A Case of Korean Adverbial Postposition -(u)lo
Seongmin Mun | Gyu-Ho Shin
Proceedings of the 31st International Conference on Computational Linguistics

This study examines how Transformer language models utilise lexico-phrasal information to interpret the polysemy of the Korean adverbial postposition -(u)lo. We analysed the attention weights of both a Korean pre-trained BERT model and a fine-tuned version. Results show a general reduction in attention weights following fine-tuning, alongside changes in the lexico-phrasal information used, depending on the specific function of -(u)lo. These findings suggest that, while fine-tuning broadly affects a model’s syntactic sensitivity, it may also alter its capacity to leverage lexico-phrasal features according to the function of the target word.

pdf bib abs
Second language Korean Universal Dependency treebank v1.2: Focus on Data Augmentation and Annotation Scheme Refinement
Hakyung Sung | Gyu-Ho Shin
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models—Stanza, spaCy, and Trankit—and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

2024

pdf bib abs
Constructing a Dependency Treebank for Second Language Learners of Korean
Hakyung Sung | Gyu-Ho Shin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce a manually annotated syntactic treebank based on Universal Dependencies, derived from the written data of second language (L2) Korean learners. In developing this new dataset, we critically evaluated previous works and revised the annotation guidelines to better reflect the linguistic properties of Korean and the characteristics of L2 learners. The L2 Korean treebank encompasses 7,530 sentences (66,982 words; 129,333 morphemes) and is publicly available at: https://github.com/NLPxL2Korean/L2KW-corpus.

2023

pdf bib abs
Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners
Hakyung Sung | Gyu-Ho Shin
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

We introduce the Korean-Learner-Morpheme (KLM) corpus, a manually annotated dataset consisting of 129,784 morphemes from second language (L2) learners of Korean, featuring morpheme tokenization and part-of-speech (POS) tagging. We evaluate the performance of four Korean morphological analyzers in tokenization and POS tagging on the L2- Korean corpus. Results highlight the analyzers’ reduced performance on L2 data, indicating the limitation of advanced deep-learning models when dealing with L2-Korean corpora. We further show that fine-tuning one of the models with the KLM corpus improves its accuracy of tokenization and POS tagging on L2-Korean dataset.

pdf bib abs
Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean
Hakyung Sung | Gyu-Ho Shin
Findings of the Association for Computational Linguistics: EMNLP 2023

This study investigates the extent to which currently available morpheme parsers/taggers apply to lesser-studied languages and language-usage contexts, with a focus on second language (L2) Korean. We pursue this inquiry by (1) training a neural-network model (pre-trained on first language [L1] Korean data) on varying L2 datasets and (2) measuring its morpheme parsing/POS tagging performance on L2 test sets from both the same and different sources of the L2 train sets. Results show that the L2 trained models generally excel in domain-specific tokenization and POS tagging compared to the L1 pre-trained baseline model. Interestingly, increasing the size of the L2 training data does not lead to improving model performance consistently.

Co-authors

Hakyung Sung 4
Seongmin Mun 1

Venues

coling2
bea1
findings1
lrec1
resourceful1
show all...

ws1

Fix data