SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (2023)


up

pdf (full)
bib (full)
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf bib
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Stefania Degaetano-Ortlieb | Anna Kazantseva | Nils Reiter | Stan Szpakowicz

pdf bib
Standard and Non-standard Adverbial Markers: a Diachronic Analysis in Modern Chinese Literature
John Lee | Fangqiong Zhan | Wenxiu Xie | Xiao Han | Chi-yin Chow | Kam-yiu Lam

This paper investigates the use of standard and non-standard adverbial markers in modern Chinese literature. In Chinese, adverbials can be derived from many adjectives, adverbs and verbs with the suffix “de”. The suffix has a standard and a non-standard written form, both of which are frequently used. Contrastive research on these two competing forms has mostly been qualitative or limited to small text samples. In this first large-scale quantitative study, we present statistics on 346 adverbial types from an 8-million-character text corpus drawn from Chinese literature in the 20th century. We present a semantic analysis of the verbs modified by adverbs with standard and non-standard markers, and a chronological analysis of marker choice among six prominent modern Chinese authors. We show that the non-standard form is more frequently used when the adverbial modifies an emotion verb. Further, we demonstrate that marker choice is correlated to text genre and register, as well as the writing style of the author.

pdf bib
GPoeT: a Language Model Trained for Rhyme Generation on Synthetic Data
Andrei Popescu-Belis | Àlex R. Atrio | Bastien Bernath | Etienne Boisson | Teo Ferrari | Xavier Theimer-Lienhard | Giorgos Vernikos

Poem generation with language models requires the modeling of rhyming patterns. We propose a novel solution for learning to rhyme, based on synthetic data generated with a rule-based rhyming algorithm. The algorithm and an evaluation metric use a phonetic dictionary and the definitions of perfect and assonant rhymes. We fine-tune a GPT-2 English model with 124M parameters on 142 MB of natural poems and find that this model generates consecutive rhymes infrequently (11%). We then fine-tune the model on 6 MB of synthetic quatrains with consecutive rhymes (AABB) and obtain nearly 60% of rhyming lines in samples generated by the model. Alternating rhymes (ABAB) are more difficult to model because of longer-range dependencies, but they are still learnable from synthetic data, reaching 45% of rhyming lines in generated samples.

pdf
Quote Detection: A New Task and Dataset for NLP
Selma Tekir | Aybüke Güzel | Samet Tenekeci | Bekir Haman

Quotes are universally appealing. Humans recognize good quotes and save them for later reference. However, it may pose a challenge for machines. In this work, we build a new corpus of quotes and propose a new task, quote detection, as a type of span detection. We retrieve the quote set from Goodreads and collect the spans through a custom search on the Gutenberg Book Corpus. We measure unique vocabulary usage by a state-of-the-art language model and perform comparative statistical analysis against the Cornell Movie-Quotes Corpus. Furthermore, we run two types of baselines for quote detection: Conditional random field (CRF) and summarization with pointer-generator networks and Bidirectional and Auto-Regressive Transformers (BART). The results show that the neural sequence-to-sequence models perform substantially better than CRF. From the viewpoint of neural extractive summarization, quote detection seems easier than news summarization. Moreover, model fine-tuning on our corpus and the Cornell Movie-Quotes Corpus introduces incremental performance boosts.

pdf
Improving Long-Text Authorship Verification via Model Selection and Data Tuning
Trang Nguyen | Charlie Dagli | Kenneth Alperin | Courtland Vandam | Elliot Singer

Authorship verification is used to link texts written by the same author without needing a model per author, making it useful to deanonymizing users spreading text with malicious intent. In this work, we evaluated our Cross-Encoder system with four Transformers using differently tuned variants of fanfiction data and found that our BigBird pipeline outperformed Longformer, RoBERTa, and ELECTRA and performed competitively against the official top ranked system from the PAN evaluation. We also examined the effect of authors and fandoms not seen in training on model performance. Through this, we found fandom has the greatest influence on true trials, and that a balanced training dataset in terms of class and fandom performed the most consistently.

pdf
Fractality of informativity in 300 years of English scientific writing
Yuri Bizzoni | Stefania Degaetano-ortlieb

Scientific writing is assumed to have become more informationally dense over time (Halliday, 1988; Biber and Gray, 2016). By means of fractal analysis, we study whether over time the degree of informativity has become more persistent with predictable patterns of gradual changes between high vs. low informational content, indicating a trend towards an optimal code for scientific communication.

pdf
Direct Speech Quote Attribution for Dutch Literature
Andreas Van Cranenburgh | Frank Van Den Berg

We present a dataset and system for quote attribution in Dutch literature. The system is implemented as a neural module in an existing NLP pipeline for Dutch literature (dutchcoref; van Cranenburgh, 2019). Our contributions are as follows. First, we provide guidelines for Dutch quote attribution and annotate 3,056 quotes in fragments of 42 Dutch literary novels, both contemporary and classic. Second, we present three neural quote attribution classifiers, optimizing for precision, recall, and F1. Third, we perform an evaluation and analysis of quote attribution performance, showing that in particular, quotes with an implicit speaker are challenging, and that such quotes are prevalent in contemporary fiction (57%, compared to 32% for classic novels). On the task of quote attribution, we achieve an improvement of 8.0% F1 points on contemporary fiction and 1.9% F1 points on classic novels. Code, data, and models are available at https://github.com/anonymized/repository.

pdf
Great Bibliographies as a Source of Data for the Humanities – NLP in the Analysis of Gender of Book Authors in German Countries and in Poland (1801-2021)
Adam Pawłowski | Tomasz Walkowiak

The subject of this article is the application of NLP and text-mining methods to the analysis of two large bibliographies: Polish one, based on the catalogs of the National Library in Warsaw, and the other German one, created by Deutsche Nationalbibliothek. The data in both collections are stored in MARC 21 format, allowing the selection of relevant fields that are used for further processing (basically author, title, and date). The volume of the Polish corpus (after filtering out non-relevant or incomplete items) includes 1.4 mln of records, and that of the German corpus 7.5 mln records. The time span of both bibliographies extends from 1801 to 2021. The aim of the study is to compare the gender distribution of book authors in Polish and German databases over more than two centuries. The proportions of male and female authors since 1801 were calculated automatically, and NLP methods such as document vector embedding based on deep BERT networks were used to extract topics from titles. The gender of the Polish authors was recognized based on the morphology of the first names, and that of the German authors based on a predefined list. The study found that the proportion of female authors has been steadily increasing both in Poland and in German countries (currently around 43%). However, the topics of women’s and men’s writings invariably remain different since 1801.

pdf
Emotion Recognition based on Psychological Components in Guided Narratives for Emotion Regulation
Gustave Cortal | Alain Finkel | Patrick Paroubek | Lina Ye

Emotion regulation is a crucial element in dealing with emotional events and has positive effects on mental health. This paper aims to provide a more comprehensive understanding of emotional events by introducing a new French corpus of emotional narratives collected using a questionnaire for emotion regulation. We follow the theoretical framework of the Component Process Model which considers emotions as dynamic processes composed of four interrelated components (behavior, feeling, thinking and territory). Each narrative is related to a discrete emotion and is structured based on all emotion components by the writers. We study the interaction of components and their impact on emotion classification with machine learning methods and pre-trained language models. Our results show that each component improves prediction performance, and that the best results are achieved by jointly considering all components. Our results also show the effectiveness of pre-trained language models in predicting discrete emotion from certain components, which reveal differences in how emotion components are expressed.

pdf
Linking the Neulateinische Wortliste to the LiLa Knowledge Base of Interoperable Resources for Latin
Federica Iurescia | Eleonora Litta | Marco Passarotti | Matteo Pellegrini | Giovanni Moretti | Paolo Ruffolo

This paper describes the process of interlinking a lexical resource consisting of a list of more than 20,000 Neo-Latin words with other resources for Latin. The resources are made interoperable thanks to their linking to the anonymous Knowledge Base, which applies Linguistic Linked Open Data practices and data categories to describe and publish on the Web both textual and lexical resources for the Latin language.

pdf
What do Humor Classifiers Learn? An Attempt to Explain Humor Recognition Models
Marcio Lima Inácio | Gabriela Wick-pedro | Hugo Goncalo Oliveira

Towards computational systems capable of dealing with complex and general linguistic phenomena, it is essential to understand figurative language, which verbal humor is an instance of. This paper reports state-of-the-art results for Humor Recognition in Portuguese, specifically, an F1-score of 99.64% with a BERT-based classifier. However, following the surprising high performance in such a challenging task, we further analyzed what was actually learned by the classifiers. Our main conclusions were that classifiers based on content-features achieve the best performance, but rely mostly on stylistic aspects of the text, not necessarily related to humor, such as punctuation and question words. On the other hand, for humor-related features, we identified some important aspects, such as the presence of named entities, ambiguity and incongruity.

pdf
Constructing a Credible Estimation for Overreporting of Climate Adaptation Funds in the Creditor Reporting System
Janos Borst | Thomas Wencker | Andreas Niekler

Development funds are essential to finance climate change adaptation and are thus an important part of international climate policy. How ever, the absence of a common reporting practice makes it difficult to assess the amount and distribution of such funds. Research has questioned the credibility of reported figures, indicating that adaptation financing is in fact lower than published figures suggest. Projects claiming a greater relevance to climate change adaptation than they target are referred to as “overreported”. To estimate realistic rates of overreporting in large data sets over times, we propose an approach based on state-of-the-art text classification. To date, assessments of credibility have relied on small, manually evaluated samples. We use such a sample data set to train a classifier with an accuracy of 89.81%±0.83% (tenfold cross-validation) and extrapolate to larger data sets to identify overreporting. Additionally, we propose a method that incorporates evidence of smaller, higher-quality data to correct predicted rates using Bayes’ theorem. This enables a comparison of different annotation schemes to estimate the degree of overreporting in climate change adaptation. Our results support findings that indicate extensive overreporting of 32.03% with a credible interval of [19.81%; 48.34%].

pdf
“Who is the Madonna of Italian-American Literature?”: Target Entity Extraction and Analysis of Vossian Antonomasia
Michel Schwab | Robert Jäschke | Frank Fischer

In this paper, we present approaches for the automated extraction and disambiguation of a part of the stylistic device Vossian Antonomasia (VA), namely the target entity that is described by the expression. We model the problem as a coreference resolution task and a question answering task and also combine both tasks. To tackle these tasks, we utilize state-of-the-art models in these areas. In addition, we visualize the connection between the source and target entities of VA in a web demo to get a deeper understanding of the interaction of entities used in VA expressions.

pdf
Detecting intersectionality in NER models: A data-driven approach
Ida Marie S. Lassen | Mina Almasi | Kenneth Enevoldsen | Ross Deans Kristensen-McLachlan

The presence of bias is a pressing concern for both engineers and users of language technology. What is less clear is how exactly bias can be measured, so as to rank models relative to the biases they display. Using an innovative experimental method involving data augmentation, we measure the effect of intersectional biases in Danish models used for Name Entity Recognition (NER). We quantify differences in representational biases, understood as a systematic difference in error or what is called error disparity. Our analysis includes both gender and ethnicity to illustrate the effect of multiple dimensions of bias, as well as experiments which look to move beyond a narrowly binary analysis of gender. We show that all contemporary Danish NER models perform systematically worse on non-binary and minority ethnic names, while not showing significant differences for typically Danish names. Our data augmentation technique can be applied on other languages to test for biases which might be relevant for researchers applying NER models to the study of cultural heritage data.

pdf
OdyCy – A general-purpose NLP pipeline for Ancient Greek
Jan Kostkan | Márton Kardos | Jacob Palle Bliddal Mortensen | Kristoffer Laigaard Nielbo

This paper presents a general-purpose NLP pipeline that achieves state-of-the-art performance on the Ancient Greek Perseus UD Treebank for several tasks (POS Tagging, Morphological Analysis and Dependency Parsing), and close to state-of-the-art performance on the Proiel UD Treebank. Our aim is to provide a reproducible, open source language processing pipeline for Ancient Greek, capable of handling input texts of varying quality. We measure the performance of our model against other comparable tools and then evaluate lemmatization errors.

pdf
Scent Mining: Extracting Olfactory Events, Smell Sources and Qualities
Stefano Menini | Teresa Paccosi | Serra Sinem Tekiroğlu | Sara Tonelli

Olfaction is a rather understudied sense compared to the other senses. In NLP, however, there have been recent attempts to develop taxonomies and benchmarks specifically designed to capture smell-related information. In this work, we further extend this research line by presenting a supervised system for olfactory information extraction in English. We cast this problem as a token classification task and build a system that identifies smell words, smell sources and qualities. The classifier is then applied to a set of English historical corpora, covering different domains and written in a time period between the 15th and the 20th Century. A qualitative analysis of the extracted data shows that they can be used to infer interesting information about smelly items such as tea and tobacco from a diachronical perspective, supporting historical investigation with corpus-based evidence.

pdf
Exploring Social Sciences Archives with Explainable Document Linkage through Question Generation
Elie Antoine | Hyun Jung Kang | Ismaël Rousseau | Ghislaine Azémard | Frederic Bechet | Geraldine Damnati

This paper proposes a new approach for exploring digitized humanities and social sciences collections based on explainable links built from questions. Our experiments show the quality of our automatically generated questions and their relevance in a local context as well as the originality of the links produced by embeddings based on these questions. Analyses have also been performed to understand the types of questions generated on our corpus, and the related uses that can enrich the exploration. The relationships between the co-references and the questions generated, and the answers extracted from the text were also discussed and open a path for future improvements for our system in their resolution.

pdf
Wartime Media Monitor (WarMM-2022): A Study of Information Manipulation on Russian Social Media during the Russia-Ukraine War
Maxim Alyukov | Maria Kunilovskaya | Andrei Semenov

This study relies on natural language processing to explore the nature of online communication in Russia during the war on Ukraine in 2022. The analysis of a large corpus of publications in traditional media and on social media identifies massive state interventions aimed at manipulating public opinion. The study relies on expertise in media studies and political science to trace the major themes and strategies of the propagandist narratives on three major Russian social media platforms over several months as well as their perception by the users. Distributions of several keyworded pro-war and anti-war topics are examined to reveal the cross-platform specificity of social media audiences. We release WarMM-2022, a 1.7M posts corpus. This corpus includes publications related to the Russia-Ukraine war, which appeared in Russian mass media and on social networks between February and September 2022. The corpus can be useful for the development of NLP approaches to propaganda detection and subsequent studies of propaganda campaigns in social sciences in addition to traditional methods, such as content analysis, focus groups, surveys, and experiments.

pdf
Towards a More In-Depth Detection of Political Framing
Qi Yu

In social sciences, recent years have witnessed a growing interest in applying NLP approaches to automatically detect framing in political discourse. However, most NLP studies by now focus heavily on framing effect arising from topic coverage, whereas framing effect arising from subtle usage of linguistic devices remains understudied. In a collaboration with political science researchers, we intend to investigate framing strategies in German newspaper articles on the “European Refugee Crisis”. With the goal of a more in-depth framing analysis, we not only incorporate lexical cues for shallow topic-related framing, but also propose and operationalize a variety of framing-relevant semantic and pragmatic devices, which are theoretically derived from linguistics and political science research. We demonstrate the influential role of these linguistic devices with a large-scale quantitative analysis, bringing novel insights into the linguistic properties of framing.

pdf
Named Entity Annotation Projection Applied to Classical Languages
Tariq Yousef | Chiara Palladino | Gerhard Heyer | Stefan Jänicke

In this study, we demonstrate how to apply cross-lingual annotation projection to transfer named-entity annotations to classical languages for which limited or no resources and annotated texts are available, aiming to enrich their NER training datasets and train a model to perform NER tagging. Our method uses sentence-level aligned parallel corpora ancient texts and the translation in a modern language, for which high-quality off-the-shelf NER systems are available. We automatically annotate the text of the modern language and employ a state-of-the-art neural word alignment system to find translation equivalents. Finally, we transfer the annotations to the corresponding tokens in the ancient texts using a direct projection heuristic. We applied our method to ancient Greek, Latin, and Arabic using the Bible with the English translation as a parallel corpus. We used the resulting annotations to enhance the performance of an existing NER model for ancient Greek