Katalin Ilona Simkó

Also published as: Katalin Simkó


USzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques
Katalin Ilona Simkó | Viktória Kovács | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

The paper describes our system submitted for the Workshop on Multiword Expressions’ shared task on automatic identification of verbal multiword expressions. It uses POS tagging and dependency parsing to identify single- and multi-token verbal MWEs in text. Our system is language independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis for the languages it was submitted for.

Hungarian Copula Constructions in Dependency Syntax and Parsing
Katalin Ilona Simkó | Veronika Vincze
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

Universal Dependencies and Morphology for Hungarian - and on the Price of Universality
Veronika Vincze | Katalin Simkó | Zsolt Szántó | Richárd Farkas
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance.


A Hungarian Sentiment Corpus Manually Annotated at Aspect Level
Martina Katalin Szabó | Veronika Vincze | Katalin Ilona Simkó | Viktor Varga | Viktor Hangya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a Hungarian sentiment corpus manually annotated at aspect level. Our corpus consists of Hungarian opinion texts written about different types of products. The main aim of creating the corpus was to produce an appropriate database providing possibilities for developing text mining software tools. The corpus is a unique Hungarian database: to the best of our knowledge, no digitized Hungarian sentiment corpus that is annotated on the level of fragments and targets has been made so far. In addition, many language elements of the corpus, relevant from the point of view of sentiment analysis, got distinct types of tags in the annotation. In this paper, on the one hand, we present the method of annotation, and we discuss the difficulties concerning text annotation process. On the other hand, we provide some quantitative and qualitative data on the corpus. We conclude with a description of the applicability of the corpus.


An Empirical Evaluation of Automatic Conversion from Constituency to Dependency in Hungarian
Katalin Ilona Simkó | Veronika Vincze | Zsolt Szántó | Richárd Farkas
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Annotating Uncertainty in Hungarian Webtext
Veronika Vincze | Katalin Ilona Simkó | Viktor Varga
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Veronika Vincze | Viktor Varga | Katalin Ilona Simkó | János Zsibrita | Ágoston Nagy | Richárd Farkas | János Csirik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.