János Csirik


Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Veronika Vincze | Viktor Varga | Katalin Ilona Simkó | János Zsibrita | Ágoston Nagy | Richárd Farkas | János Csirik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.


pdf bib
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task
Richárd Farkas | Veronika Vincze | György Szarvas | György Móra | János Csirik
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text
Richárd Farkas | Veronika Vincze | György Móra | János Csirik | György Szarvas
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

Hungarian Corpus of Light Verb Constructions
Veronika Vincze | János Csirik
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Hungarian Dependency Treebank
Veronika Vincze | Dóra Szauter | Attila Almási | György Móra | Zoltán Alexin | János Csirik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Herein, we present the process of developing the first Hungarian Dependency TreeBank. First, short references are made to dependency grammars we considered important in the development of our Treebank. Second, mention is made of existing dependency corpora for other languages. Third, we present the steps of converting the Szeged Treebank into dependency-tree format: from the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency corpus for Hungarian. We also go into detail about the two major sets of problems, i.e. coordination and predicative nouns and adjectives. Fourth, we give statistics on the treebank: by now, we have completed the annotation of business news, newspaper articles, legal texts and texts in informatics, at the same time, we are planning to convert the entire corpus into dependency tree format. Finally, we give some hints on the applicability of the system: the present database may be utilized ― among others ― in information extraction and machine translation as well.


Hungarian Word-Sense Disambiguated Corpus
Veronika Vincze | György Szarvas | Attila Almási | Dóra Szauter | Róbert Ormándi | Richárd Farkas | Csaba Hatvani | János Csirik
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage, and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (whole article), and information on the lemma, POS-tagging and automatic tokenization is also available. When planning the corpus, 300-500 samples of each word form were to be annotated. This size makes it possible that the subcorpora prepared for the individual word forms can be compared to data available for other languages. However, the finalized database also contains unannotated samples and samples with single annotation, which were annotated only by one of the linguists. The corpus follows the ACL’s SensEval/SemEval WSD tasks format. The first version of the corpus was developed within the scope of the project titled The construction Hungarian WordNet Ontology and its application in Information Extraction Systems (Hatvani et al., 2007). The corpus “ for research and educational purposes” is available and can be downloaded free of charge.

The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts
György Szarvas | Veronika Vincze | Richárd Farkas | János Csirik
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing


A highly accurate Named Entity corpus for Hungarian
György Szarvas | Richárd Farkas | László Felföldi | András Kocsor | János Csirik
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.


The Szeged Corpus. A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus
Dóra Csendes | János Csirik | Tibor Gyimóthy
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora


Annotated Hungarian National Corpus
Zoltán Alexin | János Csirik | Tibor Gyimóthy | Károly Bibok | Csaba Hatvani | Gábor Prószéky | László Tihanyi
10th Conference of the European Chapter of the Association for Computational Linguistics