Piroska Lendvai


2024

pdf
A Workflow for HTR-Postprocessing, Labeling and Classifying Diachronic and Regional Variation in Pre-Modern Slavic Texts
Piroska Lendvai | Maarten van Gompel | Anna Jouravel | Elena Renje | Uwe Reichel | Achim Rabus | Eckhart Arnold
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We describe ongoing work for developing a workflow for the applied use case of classifying diachronic and regional language variation in Pre-Modern Slavic texts. The data were obtained via handwritten text recognition (HTR) on medieval manuscripts and printings and partly by manual transcription. Our goal is to develop a workflow for such historical language data, covering HTR-postprocessing, annotating and classifying the digitized texts. We test and adapt existing language resources to fit the pipeline with low-barrier tooling, accessible for Humanists with limited experience in research data infrastructures, computational analysis or advanced methods of natural language processing (NLP). The workflow starts by addressing ground truth (GT) data creation for diagnosing and correcting HTR errors via string metrics and data-driven methods. On GT and on HTR data, we subsequently show classification results using transfer learning on sentence-level text snippets. Next, we report on our token-level data labeling efforts. Each step of the workflow is complemented with describing current limitations and our corresponding work in progress.

2023

pdf bib
Domain-Adapting BERT for Attributing Manuscript, Century and Region in Pre-Modern Slavic Texts
Piroska Lendvai | Uwe Reichel | Anna Jouravel | Achim Rabus | Elena Renje
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Our study presents a stratified dataset compiled from six different Slavic bodies of text, for cross-linguistic and diachronic analyses of Slavic Pre-Modern language variants. We demonstrate unsupervised domain adaptation and supervised finetuning of BERT on these low-resource, historical Slavic variants, for the purposes of provenance attribution in terms of three downstream tasks: manuscript, century and copying region classification.The data compilation aims to capture diachronic as well as regional language variation and change: the texts were written in the course of roughly a millennium, incorporating language variants from the High Middle Ages to the Early Modern Period, and originate from a variety of geographic regions. Mechanisms of language change in relatively small portions of such data have been inspected, analyzed and typologized by Slavists manually; our contribution aims to investigate the extent to which the BERT transformer architecture and pretrained models can benefit this process. Using these datasets for domain adaptation, we could attribute temporal, geographical and manuscript origin on the level of text snippets with high F-scores. We also conducted a qualitative analysis of the models’ misclassifications.

2022

pdf
Finetuning Latin BERT for Word Sense Disambiguation on the Thesaurus Linguae Latinae
Piroska Lendvai | Claudia Wick
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

The Thesaurus Linguae Latinae (TLL) is a comprehensive monolingual dictionary that records contextualized meanings and usages of Latin words in antique sources at an unprecedented scale. We created a new dataset based on a subset of sense representations in the TLL, with which we finetuned the Latin-BERT neural language model (Bamman and Burns, 2020) on a supervised Word Sense Disambiguation task. We observe that the contextualized BERT representations finetuned on TLL data score better than static embeddings used in a bidirectional LSTM classifier on the same dataset, and that our per-lemma BERT models achieve higher and more robust performance than reported by Bamman and Burns (2020) based on data from a bilingual Latin dictionary. We demonstrate the differences in sense organizational principles between these two lexical resources, and report about our dataset construction and improved evaluation methodology.

2020

pdf
Detection of Reading Absorption in User-Generated Book Reviews: Resources Creation and Evaluation
Piroska Lendvai | Sándor Darányi | Christian Geng | Moniek Kuijpers | Oier Lopez de Lacalle | Jean-Christophe Mensonides | Simone Rebora | Uwe Reichel
Proceedings of the Twelfth Language Resources and Evaluation Conference

To detect how and when readers are experiencing engagement with a literary work, we bring together empirical literary studies and language technology via focusing on the affective state of absorption. The goal of our resource development is to enable the detection of different levels of reading absorption in millions of user-generated reviews hosted on social reading platforms. We present a corpus of social book reviews in English that we annotated with reading absorption categories. Based on these data, we performed supervised, sentence level, binary classification of the explicit presence vs. absence of the mental state of absorption. We compared the performances of classical machine learners where features comprised sentence representations obtained from a pretrained embedding model (Universal Sentence Encoder) vs. neural classifiers in which sentence embedding vector representations are adapted or fine-tuned while training for the absorption recognition task. We discuss the challenges in creating the labeled data as well as the possibilities for releasing a benchmark corpus.

2016

pdf
Towards a Formal Representation of Components of German Compounds
Thierry Declerck | Piroska Lendvai
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf
Veracity Computing from Lexical Cues and Perceived Certainty Trends
Uwe Reichel | Piroska Lendvai
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

We present a data-driven method for determining the veracity of a set of rumorous claims on social media data. Tweets from different sources pertaining to a rumor are processed on three levels: first, factuality values are assigned to each tweet based on four textual cue categories relevant for our journalism use case; these amalgamate speaker support in terms of polarity and commitment in terms of certainty and speculation. Next, the proportions of these lexical cues are utilized as predictors for tweet certainty in a generalized linear regression model. Subsequently, lexical cue proportions, predicted certainty, as well as their time course characteristics are used to compute veracity for each rumor in terms of the identity of the rumor-resolving tweet and its binary resolution value judgment. The system operates without access to extralinguistic resources. Evaluated on the data portion for which hand-labeled examples were available, it achieves .74 F1-score on identifying rumor resolving tweets and .76 F1-score on predicting if a rumor is resolved as true or false.

pdf
Contradiction Detection for Rumorous Claims
Piroska Lendvai | Uwe Reichel
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)

The utilization of social media material in journalistic workflows is increasing, demanding automated methods for the identification of mis- and disinformation. Since textual contradiction across social media posts can be a signal of rumorousness, we seek to model how claims in Twitter posts are being textually contradicted. We identify two different contexts in which contradiction emerges: its broader form can be observed across independently posted tweets and its more specific form in threaded conversations. We define how the two scenarios differ in terms of central elements of argumentation: claims and conversation structure. We design and evaluate models for the two scenarios uniformly as 3-way Recognizing Textual Entailment tasks in order to represent claims and conversation structure implicitly in a generic inference model, while previous studies used explicit or no representation of these properties. To address noisy text, our classifiers use simple similarity features derived from the string and part-of-speech level. Corpus statistics reveal distribution differences for these features in contradictory as opposed to non-contradictory tweet relations, and the classifiers yield state of the art performance.

pdf
Monolingual Social Media Datasets for Detecting Contradiction and Entailment
Piroska Lendvai | Isabelle Augenstein | Kalina Bontcheva | Thierry Declerck
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Entailment recognition approaches are useful for application domains such as information extraction, question answering or summarisation, for which evidence from multiple sentences needs to be combined. We report on a new 3-way judgement Recognizing Textual Entailment (RTE) resource that originates in the Social Media domain, and explain our semi-automatic creation method for the special purpose of information verification, which draws on manually established rumourous claims reported during crisis events. From about 500 English tweets related to 70 unique claims we compile and evaluate 5.4k RTE pairs, while continue automatizing the workflow to generate similar-sized datasets in other languages.

2015

pdf
Processing and Normalizing Hashtags
Thierry Declerck | Piroska Lendvai
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Towards the Representation of Hashtags in Linguistic Linked Open Data Format
Thierry Declerck | Piroska Lendvai
Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data

2013

pdf bib
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Piroska Lendvai | Kalliopi Zervanou
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2012

pdf
Accessing and standardizing Wiktionary lexical entries for the translation of labels in Cultural Heritage taxonomies
Thierry Declerck | Karlheinz Mörth | Piroska Lendvai
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the usefulness of Wiktionary, the freely available web-based lexical resource, in providing multilingual extensions to catalogues that serve content-based indexing of folktales and related narratives. We develop conversion tools between Wiktionary and TEI, using ISO standards (LMF, MAF), to make such resources available to both the Digital Humanities community and the Language Resources community. The converted data can be queried via a web interface, while the tools of the workflow are to be released with an open source license. We report on the actual state and functionality of our tools and analyse some shortcomings of Wiktionary, as well as potential domains of application.

2011

pdf bib
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Kalliopi Zervanou | Piroska Lendvai
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2010

pdf
Integration of Linguistic Markup into Semantic Models of Folk Narratives: The Fairy Tale Use Case
Piroska Lendvai | Thierry Declerck | Sándor Darányi | Pablo Gervás | Raquel Hervás | Scott Malec | Federico Peinado
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Propp's influential structural analysis of fairy tales created a powerful schema for representing storylines in terms of character functions, which is directly exploitable for computational semantic analysis, and procedural generation of stories of this genre. We tackle two resources that draw on the Proppian model - one formalizes it as a semantic markup scheme and the other as an ontology -, both lacking linguistic phenomena explicitly represented in them. The need for integrating linguistic information into structured semantic resources is motivated by the emergence of suitable standards that facilitate this, as well as the benefits such joint representation would create for transdisciplinary research across Digital Humanities, Computational Linguistics, and Artificial Intelligence.

pdf
Towards a Standardized Linguistic Annotation of the Textual Content of Labels in Knowledge Representation Systems
Thierry Declerck | Piroska Lendvai
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

WWe propose applying standardized linguistic annotation to terms included in labels of knowledge representation schemes (taxonomies or ontologies), hypothesizing that this would help improving ontology-based semantic annotation of texts. We share the view that currently used methods for including lexical and terminological information in such hierarchical networks of concepts are not satisfactory, and thus put forward ― as a preliminary step to our annotation goal ― a model for modular representation of conceptual, terminological and linguistic information within knowledge representation systems. Our CTL model is based on two recent initiatives that describe the representation of terminologies and lexicons in ontologies: the Terminae method for building terminological and ontological models from text (Aussenac-Gilles et al., 2008), and the LexInfo metamodel for ontology lexica (Buitelaar et al., 2009). CTL goes beyond the mere fusion of the two models and introduces an additional level of representation for the linguistic objects, whereas those are no longer limited to lexical information but are covering the full range of linguistic phenomena, including constituency and dependency. We also show that the approach benefits linguistic and semantic analysis of external documents that are often to be linked to semantic resources for enrichment with concepts that are newly extracted or inferred.

2009

pdf bib
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCHSHELT&R 2009)
Lars Borin | Piroska Lendvai
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

pdf
Comparing Alternative Data-Driven Ontological Vistas of Natural History (short paper)
Marieke van Erp | Piroska Lendvai | Antal van den Bosch
Proceedings of the Eight International Conference on Computational Semantics

pdf
Towards Acquisition of Taxonomic Inference (short paper)
Piroska Lendvai
Proceedings of the Eight International Conference on Computational Semantics

2008

pdf
From Field Notes towards a Knowledge Base
Piroska Lendvai | Steve Hunt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We describe the process of converting plain text cultural heritage data to elements of a domain-specific knowledge base, using general machine learning techniques. First, digitised expedition field notes are segmented and labelled automatically. In order to obtain perfect records, we create an annotation tool that features selective sampling, allowing domain experts to validate automatically labelled text, which is then stored in a database. Next, the records are enriched with semi-automatically derived secondary metadata. Metadata enable fine-grained querying, the results of which are additionally visualised using maps and photos.

2007

pdf
Token-based Chunking of Turn-internal Dialogue Act Sequences
Piroska Lendvai | Jeroen Geertzen
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2003

pdf bib
Learning to Identify Fragmented Words in Spoken Discourse
Piroska Lendvai
Student Research Workshop

pdf
Machine Learning for Shallow Interpretation of User Utterances in Spoken Dialogue Systems
Piroska Lendvai | Antal van den Bosch | Emiel Krahmer
Proceedings of the 2003 EACL Workshop on Dialogue Systems: interaction, adaptation and styes of management