Jungyeul Park

2023

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this Universal Morphological paradigms for the Korean language that preserve its distinct characteristics. For our K-UniMorph dataset, we outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata. This dataset adopts morphological feature schema from CITATION and CITATION for the Korean language as we extract inflected verb forms from the Sejong morphologically analyzed corpus that is one of the largest annotated corpora for Korean. During the data creation, our methodology also includes investigating the correctness of the conversion from the Sejong corpus. Furthermore, we carry out the inflection task using three different Korean word forms: letters, syllables and morphemes. Finally, we discuss and describe future perspectives on Korean morphological paradigms and the dataset.

2022

In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analysis.

2019

pdf abs
A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus
Jungyeul Park | Francis Tyers
Proceedings of the 13th Linguistic Annotation Workshop

In this paper we present a new annotation scheme for the Sejong part-of-speech tagged corpus based on Universal Dependencies style annotation. By using a new annotation scheme, we can produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing. We also explore the possibility of doing named-entity recognition and semantic-role labelling for Korean using the new annotation scheme.

pdf abs
Artificial Error Generation with Fluency Filtering
Mengyang Qiu | Jungyeul Park
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

The quantity and quality of training data plays a crucial role in grammatical error correction (GEC). However, due to the fact that obtaining human-annotated GEC data is both time-consuming and expensive, several studies have focused on generating artificial error sentences to boost training data for grammatical error correction, and shown significantly better performance. The present study explores how fluency filtering can affect the quality of artificial errors. By comparing artificial data filtered by different levels of fluency, we find that artificial error sentences with low fluency can greatly facilitate error correction, while high fluency errors introduce more noise.

pdf abs
Improving Precision of Grammatical Error Correction with a Cheat Sheet
Mengyang Qiu | Xuejiao Chen | Maggie Liu | Krishna Parvathala | Apurva Patil | Jungyeul Park
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we explore two approaches of generating error-focused phrases and examine whether these phrases can lead to better performance in grammatical error correction for the restricted track of BEA 2019 Shared Task on GEC. Our results show that phrases directly extracted from GEC corpora outperform phrases from statistical machine translation phrase table by a large margin. Appending error+context phrases to the original GEC corpora yields comparably high precision. We also explore the generation of artificial syntactic error sentences using error+context phrases for the unrestricted track. The additional training data greatly facilitates syntactic error correction (e.g., verb form) and contributes to better overall performance.

2018

pdf
Data Anonymization for Requirements Quality Analysis: a Reproducible Automatic Error Detection Task
Juyeon Kang | Jungyeul Park
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Le benchmarking de la reconnaissance d’entités nommées pour le français (Benchmarking for French NER)
Jungyeul Park
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Cet article présente une tâche du benchmarking de la reconnaissance de l’entité nommée (REN) pour le français. Nous entrainons et évaluons plusieurs algorithmes d’étiquetage de séquence, et nous améliorons les résultats de REN avec une approche fondée sur l’utilisation de l’apprentissage semi-supervisé et du reclassement. Nous obtenons jusqu’à 77.95%, améliorant ainsi le résultat de plus de 34 points par rapport du résultat de base du modèle.

pdf abs
Une note sur l’analyse du constituant pour le français (A Note on constituent parsing for French)
Jungyeul Park
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Cet article traite des analyses d’erreurs quantitatives et qualitatives sur les résultats de l’analyse syntaxique des constituants pour le français. Pour cela, nous étendons l’approche de Kummerfeld et al. (2012) pour français, et nous présentons les détails de l’analyse. Nous entraînons les systèmes d’analyse syntaxique statistiques et neuraux avec le corpus arboré pour français, et nous évaluons les résultats d’analyse. Le corpus arboré pour le français fournit des étiquettes syntagmatiques à grain fin, et les caractéristiques grammaticales du corpus affectent des erreurs d’analyse syntaxique.

pdf abs
L’optimisation du plongement de mots pour le français : une application de la classification des phrases (Optimization of Word Embeddings for French : an Application of Sentence Classification)
Jungyeul Park
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Nous proposons trois nouvelles méthodes pour construire et optimiser des plongements de mots pour le français. Nous utilisons les résultats de l’étiquetage morpho-syntaxique, de la détection des expressions multi-mots et de la lemmatisation pour un espace vectoriel continu. Pour l’évaluation, nous utilisons ces vecteurs sur une tâche de classification de phrases et les comparons avec le vecteur du système de base. Nous explorons également l’approche d’adaptation de domaine pour construire des vecteurs. Malgré un petit nombre de vocabulaires et la petite taille du corpus d’apprentissage, les vecteurs spécialisés par domaine obtiennent de meilleures performances que les vecteurs hors domaine.

2017

pdf bib abs
Building a Better Bitext for Structurally Different Languages through Self-training
Jungyeul Park | Loïc Dugast | Jeen-Pyo Hong | Chang-Uk Shin | Jeong-Won Cha
Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora

We propose a novel method to bootstrap the construction of parallel corpora for new pairs of structurally different languages. We do so by combining the use of a pivot language and self-training. A pivot language enables the use of existing translation models to bootstrap the alignment and a self-training procedure enables to achieve better alignment, both at the document and sentence level. We also propose several evaluation methods for the resulting alignment.

pdf
Segmentation Granularity in Dependency Representations for Korean
Jungyeul Park
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf abs
Corpus Selection Approaches for Multilingual Parsing from Raw Text to Universal Dependencies
Ryan Hornby | Clark Taylor | Jungyeul Park
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes UALing’s approach to the CoNLL 2017 UD Shared Task using corpus selection techniques to reduce training data size. The methodology is simple: we use similarity measures to select a corpus from available training data (even from multiple corpora for surprise languages) and use the resulting corpus to complete the parsing task. The training and parsing is done with the baseline UDPipe system (Straka et al., 2016). While our approach reduces the size of training data significantly, it retains performance within 0.5% of the baseline system. Due to the reduction in training data size, our system performs faster than the naïve, complete corpus method. Specifically, our system runs in less than 10 minutes, ranking it among the fastest entries for this task. Our system is available at https://github.com/CoNLL-UD-2017/UALING.

2016

pdf bib
Korean Language Resources for Everyone
Jungyeul Park | Jeen-Pyo Hong | Jeong-Won Cha
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf
Generating a Linguistic Model for Requirement Quality Analysis
Juyeon Kang | Jungyeul Park
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf abs
Named Entity Corpus Construction using Wikipedia and DBpedia Ontology
Younggyun Hahm | Jungyeul Park | Kyungtae Lim | Youngsik Kim | Dosam Hwang | Key-Sun Choi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we propose a novel method to automatically build a named entity corpus based on the DBpedia ontology. Since most of named entity recognition systems require time and effort consuming annotation tasks as training data. Work on NER has thus for been limited on certain languages like English that are resource-abundant in general. As an alternative, we suggest that the NE corpus generated by our proposed method, can be used as training data. Our approach introduces Wikipedia as a raw text and uses the DBpedia data set for named entity disambiguation. Our method is language-independent and easy to be applied to many different languages where Wikipedia and DBpedia are provided. Throughout the paper, we demonstrate that our NE corpus is of comparable quality even to the manually annotated NE corpus.

2013

pdf
Towards Fully Lexicalized Dependency Parsing for Korean
Jungyeul Park | Daisuke Kawahara | Sadao Kurohashi | Key-Sun Choi
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2012

pdf abs
Using the International Standard Language Resource Number: Practical and Technical Aspects
Khalid Choukri | Victoria Arranz | Olivier Hamon | Jungyeul Park
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the International Standard Language Resource Number (ISLRN), a new identification schema for Language Resources where a Language Resource is provided with a unique and universal name using a standardized nomenclature. This will ensure that Language Resources be identified, accessed and disseminated in a unique manner, thus allowing them to be recognized with proper references in all activities concerning Human Language Technologies as well as in all documents and scientific papers. This would allow, for instance, the formal identification of potentially repeated resources across different repositories, the formal referencing of language resources and their correct use when different versions are processed by tools.

pdf
Korean Treebank Transformation for Parser Training
DongHyun Choi | Jungyeul Park | Key-Sun Choi
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
Korean NLP2RDF Resources
YoungGyun Hahm | KyungTae Lim | Jungyeul Park | Yongun Yoon | Key-Sun Choi
Proceedings of the 10th Workshop on Asian Language Resources

Nous présentons le logiciel TiLT pour la correction des SMS et évaluons ses performances sur le corpus de SMS du DELIC. L’évaluation utilise la distance de Jaccard et la mesure BLEU. La présentation des résultats est suivie d’une analyse qualitative du système et de ses limites.

2006

pdf abs
Extraction de grammaires TAG lexicalisées avec traits à partir d’un corpus arboré pour le coréen
Jungyeul Park
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous présentons, ici, une implémentation d’un système qui n’extrait pas seulement une grammaire lexicalisée (LTAG), mais aussi une grammaire LTAG avec traits (FB-LTAG) à partir d’un corpus arboré. Nous montrons les expérimentations pratiques où nous extrayons les grammaires TAG à partir du Sejong Treebank pour le coréen. Avant tout, les 57 étiquettes syntaxiques et les analyses morphologiques dans le corpus SJTree nous permettent d’extraire les traits syntaxiques automatiquement. De plus, nous modifions le corpus pour l’extraction d’une grammaire lexicalisée et convertissons les grammaires lexicalisées en schémas d’arbre pour résoudre le problème de la couverture lexicale limitée des grammaires lexicalisées extraites.

pdf
Extraction of Tree Adjoining Grammars from a Treebank for Korean
Jungyeul Park
Proceedings of the COLING/ACL 2006 Student Research Workshop

pdf
Extracting Syntactic Features from a Korean Treebank
Jungyeul Park
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms