Veronika Vincze

2024

pdf abs
Predictive and Distinctive Linguistic Features in Schizophrenia-Bipolar Spectrum Disorders
Martina Katalin Szabó | Veronika Vincze | Bernadett Dam | Csenge Guba | Anita Bagi | István Szendi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this study, we analyze spontaneous speech transcripts from Hungarian patients with schizophrenia, schizoaffective, and bipolar disorders. Our goal is to identify distinctive linguistic features in these patient groups and controls. To our knowledge, no prior study has systematically examined the linguistic features of these disorders or explored their use in distinguishing between these patient groups. We collected recordings from 77 participants during three directed spontaneous speech tasks in a clinical setting, resulting in 458 texts. Our research group manually transcribed the recordings. We processed the written corpus texts using Natural Language Processing methods and tools. The final corpus consists of 179,515 tokens, excluding punctuation. Using this data, we analyze different linguistic features’ predictive power by computing and comparing their frequency distributions. We then attempt to automatically differentiate between patient groups and controls using our extensive set of linguistic features, employing the random forest algorithm in these experiments. Our results indicate that applying machine learning techniques based on distinctive features can effectively distinguish SZ, SAD, BD, and controls, surpassing baseline results.

2023

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

2022

In this article, we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) or mild Alzheimer disease (mAD) based on their speech transcripts, focusing only on linguistic features. In addition to the features examined in our earlier study, we introduce syntactic, semantic, and pragmatic features of spontaneous speech that might affect the detection of dementia. In order to ascertain the most useful features for distinguishing healthy controls, MCI patients, and mAD patients, we carry out a statistical analysis of the data and investigate the significance level of the extracted features among various speaker group pairs and for various speaking tasks. In the second part of the article, we use this rich feature set as a basis for an effective discrimination among the three speaker groups. In our machine learning experiments, we analyze the efficacy of each feature group separately. Our model that uses all the features achieves competitive scores, either with or without demographic information (3-class accuracy values: 68%–70%, 2-class accuracy values: 77.3%–80%). We also analyze how different data recording scenarios affect linguistic features and how they can be productively used when distinguishing MCI patients from healthy controls.

2020

pdf abs
Pártélet: A Hungarian Corpus of Propaganda Texts from the Hungarian Socialist Era
Zoltán Kmetty | Veronika Vincze | Dorottya Demszky | Orsolya Ring | Balázs Nagy | Martina Katalin Szabó
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present Pártélet, a digitized Hungarian corpus of Communist propaganda texts. Pártélet was the official journal of the governing party during the Hungarian socialism from 1956 to 1989, hence it represents the direct political agitation and propaganda of the dictatorial system in question. The paper has a dual purpose: first, to present a general review of the corpus compilation process and the basic statistical data of the corpus, and second, to demonstrate through two case studies what the dataset can be used for. We show that our corpus provides a unique opportunity for conducting research on Hungarian propaganda discourse, as well as analyzing changes of this discourse over a 35-year period of time with computer-assisted methods.

pdf abs
Automatic Detection of Hungarian Clickbait and Entertaining Fake News
Veronika Vincze | Martina Katalin Szabó
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

Online news do not always come from reliable sources and they are not always even realistic. The constantly growing number of online textual data has raised the need for detecting deception and bias in texts from different domains recently. In this paper, we identify different types of unrealistic news (clickbait and fake news written for entertainment purposes) written in Hungarian on the basis of a rich feature set and with the help of machine learning methods. Our tool achieves competitive scores: it is able to classify clickbait, fake news written for entertainment purposes and real news with an accuracy of over 80%. It is also highlighted that morphological features perform the best in this classification task.

pdf
apPILcation: an Android-based Tool for Learning Mansi
Gábor Bobály | Csilla Horváth | Veronika Vincze
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

2018

pdf
SzegedKoref: A Hungarian Coreference Corpus
Veronika Vincze | Klára Hegedűs | Alex Sliz-Nagy | Richárd Farkas
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

2017

pdf abs
Universal Dependencies and Morphology for Hungarian - and on the Price of Universality
Veronika Vincze | Katalin Simkó | Zsolt Szántó | Richárd Farkas
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance.

pdf
Language technology resources and tools for Mansi: an overview
Csilla Horváth | Norbert Szilágyi | Veronika Vincze | Ágoston Nagy
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf bib
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Stella Markantonatou | Carlos Ramisch | Agata Savary | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf abs
USzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques
Katalin Ilona Simkó | Viktória Kovács | Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

The paper describes our system submitted for the Workshop on Multiword Expressions’ shared task on automatic identification of verbal multiword expressions. It uses POS tagging and dependency parsing to identify single- and multi-token verbal MWEs in text. Our system is language independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis for the languages it was submitted for.

pdf abs
Verb-Particle Constructions in Questions
Veronika Vincze
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

In this paper, we investigate the behavior of verb-particle constructions in English questions. We present a small dataset that contains questions and verb-particle construction candidates. We demonstrate that there are significant differences in the distribution of WH-words, verbs and prepositions/particles in sentences that contain VPCs and sentences that contain only verb + prepositional phrase combinations both by statistical means and in machine learning experiments. Hence, VPCs and non-VPCs can be effectively separated from each other by using a rich feature set, containing several novel features.

pdf
Hungarian Copula Constructions in Dependency Syntax and Parsing
Katalin Ilona Simkó | Veronika Vincze
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

2016

pdf
Universal Morphology for Old Hungarian
Eszter Simon | Veronika Vincze
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib abs
Detecting Uncertainty Cues in Hungarian Social Media Texts
Veronika Vincze
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)

In this paper, we aim at identifying uncertainty cues in Hungarian social media texts. We present our machine learning based uncertainty detector which is based on a rich features set including lexical, morphological, syntactic, semantic and discourse-based features, and we evaluate our system on a small set of manually annotated social media texts. We also carry out cross-domain and domain adaptation experiments using an annotated corpus of standard Hungarian texts and show that domain differences significantly affect machine learning. Furthermore, we argue that differences among uncertainty cue types may also affect the efficiency of uncertainty detection.

pdf abs
A Hungarian Sentiment Corpus Manually Annotated at Aspect Level
Martina Katalin Szabó | Veronika Vincze | Katalin Ilona Simkó | Viktor Varga | Viktor Hangya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a Hungarian sentiment corpus manually annotated at aspect level. Our corpus consists of Hungarian opinion texts written about different types of products. The main aim of creating the corpus was to produce an appropriate database providing possibilities for developing text mining software tools. The corpus is a unique Hungarian database: to the best of our knowledge, no digitized Hungarian sentiment corpus that is annotated on the level of fragments and targets has been made so far. In addition, many language elements of the corpus, relevant from the point of view of sentiment analysis, got distinct types of tags in the annotation. In this paper, on the one hand, we present the method of annotation, and we discuss the difficulties concerning text annotation process. On the other hand, we provide some quantitative and qualitative data on the corpus. We conclude with a description of the applicability of the corpus.

pdf abs
Where Bears Have the Eyes of Currant: Towards a Mansi WordNet
Csilla Horváth | Ágoston Nagy | Norbert Szilágyi | Veronika Vincze
Proceedings of the 8th Global WordNet Conference (GWC)

Here we report the construction of a wordnet for Mansi, an endangered minority language spoken in Russia. We will pay special attention to challenges that we encountered during the building process, among which the most important ones are the low number of native speakers, the lack of thesauri and the bear language. We will discuss our solutions to these issues, which might have some theoretical implications for the methodology of wordnet building in general.

2014

pdf abs
Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Veronika Vincze | Viktor Varga | Katalin Ilona Simkó | János Zsibrita | Ágoston Nagy | Richárd Farkas | János Csirik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.

pdf abs
4FX: Light Verb Constructions in a Multilingual Parallel Corpus
Anita Rácz | István Nagy T. | Veronika Vincze
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe 4FX, a quadrilingual (English-Spanish-German-Hungarian) parallel corpus annotated for light verb constructions. We present the annotation process, and report statistical data on the frequency of LVCs in each language. We also offer inter-annotator agreement rates and we highlight some interesting facts and tendencies on the basis of comparing multilingual data from the four corpora. According to the frequency of LVC categories and the calculated Kendalls coefficient for the four corpora, we found that Spanish and German are very similar to each other, Hungarian is also similar to both, but German differs from all these three. The qualitative and quantitative data analysis might prove useful in theoretical linguistic research for all the four languages. Moreover, the corpus will be an excellent testbed for the development and evaluation of machine learning based methods aiming at extracting or identifying light verb constructions in these four languages.

pdf abs
Automatic Error Detection concerning the Definite and Indefinite Conjugation in the HunLearner Corpus
Veronika Vincze | János Zsibrita | Péter Durst | Martina Katalin Szabó
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the results of automatic error detection, concerning the definite and indefinite conjugation in the extended version of the HunLearner corpus, the learners corpus of the Hungarian language. We present the most typical structures that trigger definite or indefinite conjugation in Hungarian and we also discuss the most frequent types of errors made by language learners in the corpus texts. We also illustrate the error types with sentences taken from the corpus. Our results highlight grammatical structures that might pose problems for learners of Hungarian, which can be fruitfully applied in the teaching and practicing of such constructions from the language teachers or learners point of view. On the other hand, these results may be exploited in extending the functionalities of a grammar checker, concerning the definiteness of the verb. Our automatic system was able to achieve perfect recall, i.e. it could find all the mismatches between the type of the object and the conjugation of the verb, which is promising for future studies in this area.

pdf
Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian
Veronika Vincze | Attila Almási
Proceedings of the Seventh Global Wordnet Conference

pdf
VPCTagger: Detecting Verb-Particle Constructions With Syntax-Based Methods
István Nagy T. | Veronika Vincze
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

pdf
Annotating Uncertainty in Hungarian Webtext
Veronika Vincze | Katalin Ilona Simkó | Viktor Varga
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

pdf
An Empirical Evaluation of Automatic Conversion from Constituency to Dependency in Hungarian
Katalin Ilona Simkó | Veronika Vincze | Zsolt Szántó | Richárd Farkas
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Uncertainty Detection in Hungarian Texts
Veronika Vincze
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach
Veronika Vincze | István Nagy T. | Richárd Farkas
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
LFG-based Features for Noun Number and Article Grammatical Errors
Gábor Berend | Veronika Vincze | Sina Zarrieß | Richárd Farkas
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf
magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian
János Zsibrita | Veronika Vincze | Richárd Farkas
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf
Dependency Parsing for Identifying Hungarian Light Verb Constructions
Veronika Vincze | János Zsibrita | István Nagy T.
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Full-coverage Identification of English Light Verb Constructions
István Nagy T. | Veronika Vincze | Richárd Farkas
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Weasels, Hedges and Peacocks: Discourse-level Uncertainty in Wikipedia Articles
Veronika Vincze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf
Cross-Genre and Cross-Domain Detection of Semantic Uncertainty
György Szarvas | Veronika Vincze | Richárd Farkas | György Móra | Iryna Gurevych
Computational Linguistics, Volume 38, Issue 2 - June 2012

pdf
How to Evaluate Opinionated Keyphrase Extraction?
Gábor Berend | Veronika Vincze
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

pdf abs
Light Verb Constructions in the SzegedParalellFX English–Hungarian Parallel Corpus
Veronika Vincze
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we describe the first English-Hungarian parallel corpus annotated for light verb constructions, which contains 14,261 sentence alignment units. Annotation principles and statistical data on the corpus are also provided, and English and Hungarian data are contrasted. On the basis of corpus data, a database containing pairs of English-Hungarian light verb constructions has been created as well. The corpus and the database can contribute to the automatic detection of light verb constructions and it is also shown how they can enhance performance in several fields of NLP (e.g. parsing, information extraction/retrieval and machine translation).

pdf abs
HunOr: A Hungarian—Russian Parallel Corpus
Martina Katalin Szabó | Veronika Vincze | István Nagy T.
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present HunOr, the first multi-domain Hungarian―Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a syntactic annotation of the HunOr corpus, which will further enhance the usability of the corpus in various NLP fields such as transfer-based machine translation or cross lingual information retrieval.

pdf
Dependency Parsing of Hungarian: Baseline Results and Challenges
Richárd Farkas | Veronika Vincze | Helmut Schmid
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
Detecting Noun Compounds and Light Verb Constructions: a Contrastive Study
Veronika Vincze | István Nagy T. | Gábor Berend
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
Noun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction
István Nagy T. | Gábor Berend | Veronika Vincze
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
Multiword Expressions and Named Entities in the Wiki50 Corpus
Veronika Vincze | István Nagy T. | Gábor Berend
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
Domain-Dependent Identification of Multiword Expressions
István Nagy T. | Veronika Vincze | Gábor Berend
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Domain-Dependent Detection of Light Verb Constructions
István T. Nagy | Gábor Berend | György Móra | Veronika Vincze
Proceedings of the Second Student Research Workshop associated with RANLP 2011

pdf
Inter-domain Opinion Phrase Extraction Based on Feature Augmentation
Gábor Berend | István T. Nagy | György Móra | Veronika Vincze
Proceedings of the Second Student Research Workshop associated with RANLP 2011

2010

pdf bib
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task
Richárd Farkas | Veronika Vincze | György Szarvas | György Móra | János Csirik
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf bib
The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text
Richárd Farkas | Veronika Vincze | György Móra | János Csirik | György Szarvas
Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task

pdf
Speculation and negation annotation in natural language texts: what the case of BioScope might (not) reveal
Veronika Vincze
Proceedings of the Workshop on Negation and Speculation in Natural Language Processing

pdf
Hungarian Corpus of Light Verb Constructions
Veronika Vincze | János Csirik
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Herein, we present the process of developing the first Hungarian Dependency TreeBank. First, short references are made to dependency grammars we considered important in the development of our Treebank. Second, mention is made of existing dependency corpora for other languages. Third, we present the steps of converting the Szeged Treebank into dependency-tree format: from the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency corpus for Hungarian. We also go into detail about the two major sets of problems, i.e. coordination and predicative nouns and adjectives. Fourth, we give statistics on the treebank: by now, we have completed the annotation of business news, newspaper articles, legal texts and texts in informatics, at the same time, we are planning to convert the entire corpus into dependency tree format. Finally, we give some hints on the applicability of the system: the present database may be utilized ― among others ― in information extraction and machine translation as well.

2008

pdf
The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts
György Szarvas | Veronika Vincze | Richárd Farkas | János Csirik
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage, and to have more than one sense considered frequent in usage. HNC and its Heti Világgazdaság subcorpus provided the basis for corpus text selection. This way, each sample has a relevant context (whole article), and information on the lemma, POS-tagging and automatic tokenization is also available. When planning the corpus, 300-500 samples of each word form were to be annotated. This size makes it possible that the subcorpora prepared for the individual word forms can be compared to data available for other languages. However, the finalized database also contains unannotated samples and samples with single annotation, which were annotated only by one of the linguists. The corpus follows the ACLs SensEval/SemEval WSD tasks format. The first version of the corpus was developed within the scope of the project titled The construction Hungarian WordNet Ontology and its application in Information Extraction Systems (Hatvani et al., 2007). The corpus for research and educational purposes is available and can be downloaded free of charge.