Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar (Editors)

Anthology ID:: 2025.nlp4dh-1
Month:: May
Year:: 2025
Address:: Albuquerque, USA
Venues:: NLP4DH | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1/
DOI:
ISBN:: 979-8-89176-234-3
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.pdf

pdf bib
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | Yuri Bizzoni | So Miyagawa | Khalid Alnajjar

pdf bib abs
A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950
Zhao Fang | Liang-Chun Wu | Xuening Kong | Spencer Dean Stewart

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

pdf bib abs
Analyzing register variation in web texts through automatic segmentation
Erik Henriksson | Saara Hellström | Veronika Laippala

This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.

pdf bib abs
Analyzing Large Language Models’ pastiche ability: a case study on a 20th century Romanian author
Anca Dinu | Andra-Maria Florescu | Liviu Dinu

This study evaluated the ability of several Large Language Models (LLMs) to pastiche the literary style of the Romanian 20th century author Mateiu Caragiale, by continuing one of his novels left unfinished upon his death. We assembled a database of novels consisting of six texts by Mateiu Caragiale, including his unfinished one, six texts by Radu Albala, including a continuation of Mateiu’s novel, and six LLM generated novels that try to pastiche it. We compared the LLM generated texts with the continuation by Radu Albala, using various methods. We automatically evaluated the pastiches by standard metrics such as ROUGE, BLEU, and METEOR. We performed stylometric analysis, clustering, and authorship attribution, and a manual analysis. Both computational and manual analysis of the pastiches indicated that LLMs are able to produce fairly qualitative pastiches, without matching the professional writer performance. The study also showed that ML techniques outperformed the more recent DL ones in both clusterization and authorship attribution tasks, probably because the dataset consists of only a few literary archaic texts in Romanian. In addition, linguistically informed features were shown to be competitive compared to automatically extracted features.

pdf bib abs
RAG-Enhanced Neural Machine Translation of Ancient Egyptian Text: A Case Study of THOTH AI
So Miyagawa

This paper demonstrates how Retrieval-Augmented Generation (RAG) significantly improves translation accuracy for Middle Egyptian, a historically rich but low-resource language. We integrate a vectorized Coptic-Egyptian lexicon and morphological database into a specialized tool called THOTH AI. By supplying domain-specific linguistic knowledge to Large Language Models (LLMs) like Claude 3.5 Sonnet, our system yields translations that are more contextually grounded and semantically precise. We compare THOTH AI against various mainstream models, including Gemini 2.0, DeepSeek R1, and GPT variants, evaluating performance with BLEU, SacreBLEU, METEOR, ROUGE, and chrF. Experimental results on the coronation decree of Thutmose I (18th Dynasty) show that THOTH AI’s RAG approach provides the most accurate translations, highlighting the critical value of domain knowledge in natural language processing for ancient, specialized corpora. Furthermore, we discuss how our method benefits e-learning, digital humanities, and language revitalization efforts, bridging the gap between purely data-driven approaches and expert-driven resources in historical linguistics.

pdf bib abs
Restructuring and visualising dialect dictionary data: Report on Erzya and Moksha materials
Jack Rueter | Niko Partanen

There are a number of Uralic dialect dictionaries based on fieldwork documentation of individual minority languages from the Pre-Soviet Era. The first of these published by the Finno-Ugrian Society features the Mordvin languages, Erzya and Moksha.In this article, we describe the possibility of reusing XML dialect dictionary collection point and phonetic variant data for visualizing informative linguistic isoglosses with R programming language’s Shiny web application frame-work.We provide a description of the ‘H. Paasonen Mordvin Dictionary’, which will possibly provide the reader with a better perspective of what data and challenges might present themselves in minority language dialect dictionaries.We provide a description of how we processed our data, and then we provide conclusions followed by a more extensive section on limitations. The conclusions state that only some of the data should be rendered with R Shiny web application, whereas some data might be better rendered by other applications.Our limitations section description calls for the extension the dialect dictionary database for a more concise description of the languageforms.

pdf bib abs
Podcast Outcasts: Understanding Rumble’s Podcast Dynamics
Utkucan Balci | Jay Patel | Berkan Balci | Jeremy Blackburn

The rising popularity of podcasts as an emerging medium opens new avenues for digital humanities research, particularly when examining video-based media on alternative platforms. We present a novel data analysis pipeline for analyzing over 13K podcast videos (526 days of video content) from Rumble and YouTube that integrates advanced speech-to-text transcription, transformer-based topic modeling, and contrastive visual learning. We uncover the interplay between spoken rhetoric and visual elements in shaping political bias. Our findings reveal a distinct right-wing orientation in Rumble’s podcasts, contrasting with YouTube’s more diverse and apolitical content. By merging computational techniques with comparative analysis, our study advances digital humanities by demonstrating how large-scale multimodal analysis can decode ideological narratives in emerging media format.

pdf bib abs
I only read it for the plot! Maturity Ratings Affect Fanfiction Style and Community Engagement
Mia Jacobsen | Ross Kristensen-McLachlan

We consider the textual profiles of different fanfiction maturity ratings, how they vary across fan groups, and how this relates to reader engagement metrics. Previous studies have shown that fanfiction writing is motivated by a combination of admiration for and frustration with the fan object. These findings emerge when looking at fanfiction as a whole, as well as when it is divided into subgroups, also called fandoms. However, maturity ratings are used to indicate the intended audience of the fanfiction, as well as whether the story includes mature themes and explicit scenes. Since these ratings can be used to filter readers and writers, they can also be seen as a proxy for different reader/writer motivations and desires. We find that explicit fanfiction in particular has a distinct textual profile when compared to other maturity ratings. These findings thus nuance our understanding of reader/writer motivations in fanfiction communities, and also highlights the influence of the community norms and fan behavior more generally on these cultural products.

pdf bib abs
The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?
Fabian Retkowski | Andreas Sudmann | Alexander Waibel

Qualitative research often involves labor-intensive processes that are difficult to scale while preserving analytical depth. This paper introduces The AI Co-Ethnographer (AICoE), a novel end-to-end pipeline developed for qualitative research and designed to move beyond the limitations of simply automating code assignments, offering a more integrated approach. AICoE organizes the entire process, encompassing open coding, code consolidation, code application, and even pattern discovery, leading to a comprehensive analysis of qualitative data.

pdf bib abs
Irony Detection in Hebrew Documents: A Novel Dataset and an Evaluation of Neural Classification Methods
Avi Shmidman | Elda Weizman | Avishay Gerczuk

This paper focuses on the use of single words in quotation marks in Hebrew, which may or may not be an indication of irony. Because no annotated dataset yet exists for such cases, we annotate a new dataset consisting of over 4000 cases of words within quotation marks from Hebrew newspapers. On the basis of this dataset, we train and evaluate a series of seven BERT-based classifiers for irony detection, identifying the features and configurations that most effectively contribute the irony detection task. We release this novel dataset to the NLP community to promote future research and benchmarking regarding irony detection in Hebrew.

The increasing use of Artificial Intelligence(AI) technologies, such as Large LanguageModels (LLMs) has led to nontrivial improvementsin various tasks, including accurate authorshipidentification of documents. However,while LLMs improve such defense techniques,they also simultaneously provide a vehicle formalicious actors to launch new attack vectors.To combat this security risk, we evaluate theadversarial robustness of authorship models(specifically an authorship verification model)to potent LLM-based attacks. These attacksinclude untargeted methods - authorship obfuscationand targeted methods - authorshipimpersonation. For both attacks, the objectiveis to mask or mimic the writing style of an authorwhile preserving the original texts’ semantics,respectively. Thus, we perturb an accurateauthorship verification model, and achievemaximum attack success rates of 92% and 78%for both obfuscation and impersonation attacks,respectively.

pdf bib abs
Song Lyrics Adaptations: Computational Interpretation of the Pentathlon Principle
Barbora Štěpánková | Rudolf Rosa

Songs are an integral part of human culture, and they often resonate the most when we can sing them in our native language. However, translating song lyrics presents a unique challenge: maintaining singability, naturalness, and semantic fidelity. In this work, we computationally interpret Low’s Pentathlon Principle of singable translations to be able to properly measure the quality of adapted lyrics, breaking it down into five measurable metrics that reflect the key aspects of singable translations. Building on this foundation, we introduce a text-to-text song lyrics translation system based on generative large language models, designed to meet the Pentathlon Principle’s criteria, without relying on melodies or bilingual training data.We experiment on the English-Czech language pair: we collect a dataset of English-to-Czech bilingual song lyrics and identify the desirable values of the five Pentathlon Principle metrics based on the values achieved by human translators. Through detailed human assessment of automatically generated lyric translations, we confirm the appropriateness of the proposed metrics as well as the general validity of the Pentathlon Principle, with some insights into the variation in people’s individual preferences. All code and data are available at https://github.com/stepankovab/Computational-Interpretation-of-the-Pentathlon-Principle.

With the advent of large language models, machine translation (MT) has become a widely used, but little understood, tool for accessing historical and multilingual texts. While models like GPT, Claude, and Deepseek increasingly enable translation of low-resource and ancient languages, critical questions remain about their evaluation, optimal model selection, and the value of domain-specific training and retrieval-augmented generation setups.While AI models like GPT, Claude, and Deepseek are improving translation capabilities for low-resource and ancient languages, researchers still face important questions about how to evaluate their performance, which models work best, and whether specialized training approaches provide meaningful improvements in translation quality.This study introduces a comprehensive evaluation dataset for Buddhist Chinese to English translation, comprising 2,662 bilingual data points from 32 texts that have been selected to represent the full breadth of the Chinese Buddhist canon.We evaluate various computational metrics of translation quality (BLEU, chrF, BLEURT, GEMBA) against expert annotations from five domain specialists who rated 182 machine-generated translations. Our analysis reveals that LLM-based GEMBA scoring shows the strongest correlation with human judgment, significantly outperforming traditional metrics. We then benchmark commercial models (GPT-4 Turbo, Claude 3.5, Gemini), open-source models (Gemma 2, Deepseek-r1), and a domain-specialized model (Gemma 2 Mitra) using GEMBA. Our results demonstrate that domain-specific training enables open-weights models to achieve competitive performance with commercial systems, while also showing that retrieval-augmented generation (RAG) significantly improves translation quality for the best performing commercial models.

pdf bib abs
Effects of Publicity and Complexity in Reader Polarization
Yuri Bizzoni | Pascale Feldkamp | Kristoffer Nielbo

We investigate how Goodreads rating distributions reflect variations in audience reception across literary works. By examining a large-scale dataset of novels, we analyze whether metrics such as the entropy or standard deviation of rating distributions correlate with textual features – including perplexity, nominal ratio, and syntactic complexity. These metrics reveal a disagreement continuum: more complex texts – i.e., more cognitively demanding books, with a more canon-like textual profile – generate polarized reader responses, while mainstream works produce more uniform reactions. We compare evaluation patterns across canonical and non-canonical works, bestsellers, and prize-winners, finding that textual complexity drives rating polarization even when controlling for publicity effects. Our findings demonstrate that linguistically unpredictable texts, particularly those with higher nominal density and dependency distance, generate divergent reader evaluations. This challenges conventional literary success metrics and suggests that the shape of rating distributions offers valuable insights beyond average scores. We hope our approach establishes a productive framework for understanding how literary features influence reception and how disagreement metrics can enhance our understanding of public literary judgment.

pdf bib abs
PsyTEx: A Knowledge-Guided Approach to Refining Text for Psychological Analysis
Avanti Bhandarkar | Ronald Wilson | Anushka Swarup | Gregory Webster | Damon Woodard

LLMs are increasingly applied for tasks requiring deep interpretive abilities and psychological insights, such as identity profiling, mental health diagnostics, personalized content curation, and human resource management. However, their performance in these tasks remains inconsistent, as these characteristics are not explicitly perceptible in the text. To address this challenge, this paper introduces a novel protocol called the “Psychological Text Extraction and Refinement Framework (PsyTEx)” that leverages LLMs to isolate and amplify psychologically informative segments and evaluate LLM proficiency in interpreting complex psychological constructs from text. Using personality recognition as a case study, our extensive evaluation of five SOTA LLMs across two personality models (Big Five and Dark Triad) and two assessment levels (detection and prediction) highlights significant limitations in LLM’s ability to accurately interpret psychological traits. However, our findings show that LLMs, when used within the PsyTEx protocol, can effectively extract relevant information that closely aligns with psychological expectations, offering a structured approach to support future advancements in modeling, taxonomy construction, and text-based psychological evaluations.

pdf bib abs
Advances and Challenges in the Automatic Identification of Indirect Quotations in Scholarly Texts and Literary Works
Frederik Arnold | Robert Jäschke | Philip Kraut

Literary scholars commonly refer to the interpreted literary work using various types of quotations. Two main categories are direct and indirect quotations. In this work we focus on the automatic identification of two subtypes of indirect quotations: paraphrases and summaries. Our contributions are twofold. First, we present a dataset of scholarly works with annotations of text spans which summarize or paraphrase the interpreted drama and the source of the quotation. Second, we present a two-step approach to solve the task at hand. We found the process of annotating large training corpora very time consuming and therefore leverage GPT-generated summaries to generate training data for our approach.

pdf bib abs
Assessing Crowdsourced Annotations with LLMs: Linguistic Certainty as a Proxy for Trustworthiness
Tianyi Li | Divya Sree | Tatiana Ringenberg

Human-annotated data is fundamental for training machine learning models, yet crowdsourced annotations often contain noise and bias. In this paper, we investigate the feasibility of employing large language models (LLMs), specifically GPT-4, as evaluators of crowdsourced annotations using a zero-shot prompting strategy. We introduce a certainty-based approach that leverages linguistic cues categorized into five levels (Absolute, High, Moderate, Low, Uncertain) based on Rubin’s framework—to assess the trustworthiness of LLM-generated evaluations. Using the MAVEN dataset as a case study, we compare GPT-4 evaluations against human evaluations and observe that the alignment between LLM and human judgments is strongly correlated with response certainty. Our results indicate that LLMs can effectively serve as a preliminary filter to flag potentially erroneous annotations for further expert review.

pdf bib abs
The evolution of relative clauses in the IcePaHC treebank
Anton Ingason | Johanna Mechler

We examine how the elements that introduce relative clauses, namely relative complementizers and relative pronouns, evolve over the history of Icelandic using the phrase structure analysis of the IcePaHC treebank. The rate of these elements changes over time and, in the case of relative pronouns, is subject to effects of genre and the type of gap in the relative clause in question. Our paper is a digital humanities study of historical linguistics which would not be possible without a parsed corpus that spans all centuries involved in the change. We relate our findings to studies on the Constant Rate Effect by analyzing these effects in detail.

pdf bib abs
On Psychology of AI – Does Primacy Effect Affect ChatGPT and Other LLMs?
Mika Hämäläinen

We study the primacy effect in three commercial LLMs: ChatGPT, Gemini and Claude. We do this by repurposing the famous experiment Asch (1946) conducted using human subjects. The experiment is simple, given two candidates with equal descriptions which one is preferred if one description has positive adjectives first before negative ones and another description has negative adjectives followed by positive ones. We test this in two experiments. In one experiment, LLMs are given both candidates simultaneously in the same prompt, and in another experiment, LLMs are given both candidates separately. We test all the models with 200 candidate pairs. We found that, in the first experiment, ChatGPT preferred the candidate with positive adjectives listed first, while Gemini preferred both equally often. Claude refused to make a choice. In the second experiment, ChatGPT and Claude were most likely to rank both candidates equally. In the case where they did not give an equal rating, both showed a clear preference to a candidate that had negative adjectives listed first. Gemini was most likely to prefer a candidate with negative adjectives listed first.

pdf bib abs
The Literary Canons of Large-Language Models: An Exploration of the Frequency of Novel and Author Generations Across Gender, Race and Ethnicity, and Nationality
Paulina Toro Isaza | Nalani Kopp

Large language models (LLMs) are an emerging site for computational literary and cultural analysis. While such research has focused on applying LLMs to the analysis of literary text passages, the probabilistic mechanism used by these models for text generation lends them to also understanding literary and cultural trends. Indeed, we can imagine LLMs as constructing their own “literary canons” by encoding particular authors and book titles with high probability distributions around relevant words and text. This paper explores the frequency with which certain literary titles and authors are generated by a selection of popular proprietary and open-source models and compares it to existing conceptions of literary canon. It investigates the diversity of author mentions across gender, ethnicity, nationality as well as LLMs’ ability to accurately report such characteristics. We demonstrate that the literary canons of popular large-language models are generally aligned with the Western literary canon in that they slightly prioritize male authors and overwhelmingly prioritize White American and British authors.

pdf bib abs
Moral reckoning: How reliable are dictionary-based methods for examining morality in text?
Ines Rehbein | Lilly Brauner | Florian Ertz | Ines Reinig | Simone Ponzetto

Due to their availability and ease of use, dictionary-based measures of moral values are a popular tool for text-based analyses of morality that examine human attitudes and behaviour across populations and cultures. In this paper, we revisit the construct validity of different dictionary-based measures of morality in text that have been proposed in the literature. We discuss conceptual challenges for text-based measures of morality and present an annotation experiment where we create a new dataset with human annotations of moral rhetoric in German political manifestos. We compare the results of our human annotations with different measures of moral values, showing that none of them is able to capture the trends observed by trained human coders. Our findings have far-reaching implications for the application of moral dictionaries in the digital humanities.

pdf bib abs
Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents
Samuel Backer | Louis Hyman

New LLM-based OCR and post-OCR correction methods promise to transform computational historical research, yet their efficacy remains contested. We compare multiple correction approaches, including methods for “bootstrapping” fine-tuning with LLM-generated data, and measure their effect on downstream tasks. Our results suggest that standard OCR metrics often underestimate performance gains for historical research, underscoring the need for discipline-driven evaluations that can better reflect the needs of computational humanists.

pdf bib abs
Poetry in RAGs: Modern Greek interwar poetry generation using RAG and contrastive training
Stergios Chatzikyriakidis | Anastasia Natsina

In this paper, we discuss Modern Greek poetry generation in the style of lesser known Greek poets of the interwar period. The paper proposes the use of Retrieval-Augmented Generation (RAG) to automatically generate poetry using Large Language Models (LLMs). A corpus of Greek interwar poetry is used and prompts exemplifying the poet’s style with respect to a theme are created. These are then fed to an LLM. The results are compared to pure LLM generation and expert evaluators score poems across a number of parameters. Objective metrics such as Vocabulary Density, Average words per Sentence and Readability Index are also used to assess the performance of the models. RAG-assisted models show potential in enhancing poetry generation across a number of parameters. Base LLM models appear quite consistent across a number of categories, while the RAG model that is furthermore contrastive shows the worst performance of the three.

pdf bib abs
Using Multimodal Models for Informative Classification of Ambiguous Tweets in Crisis Response
Sumiko Teng | Emily Öhman

Social media platforms like X provide real-time information during crises but often include noisy, ambiguous data, complicating analysis. This study examines the effectiveness of multimodal models, particularly a cross-attention-based approach, in classifying tweets about the California wildfires as “informative” or “uninformative,” leveraging both text and image modalities. Using a dataset containing both ambiguous and unambiguous tweets, models were evaluated for their ability to handle real-world noisy data. Results show that the multimodal model outperforms unimodal counterparts, especially for ambiguous tweets, demonstrating its resilience and ability to integrate complementary modalities. These findings highlight the potential of multimodal approaches to enhance humanitarian response efforts by reducing information overload.

pdf bib abs
Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling
Craig Messner | Tom Lippincott

We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.

pdf bib abs
Evaluating Large Language Models for Narrative Topic Labeling
Andrew Piper | Sophie Wu

This paper evaluates the effectiveness of large language models (LLMs) for labeling topics in narrative texts, comparing performance across fiction and news genres. Building on prior studies in factual documents, we extend the evaluation to narrative contexts where story content is central. Using a ranked voting system with 200 crowdworkers, we assess participants’ preferences of topic labels by comparing multiple LLM outputs with human annotations. Our findings indicate minimal inter-model variation, with LLMs performing on par with human readers in news and outperforming humans in fiction. We conclude with a case study using a set of 25,000 narrative passages from novels illustrating the analytical value of LLM topic labels compared to traditional methods. The results highlight the significant promise of LLMs for topic labeling of narrative texts.

pdf bib abs
Beyond Cairo: Sa’idi Egyptian Arabic Literary Corpus Construction and Analysis
Mai Mohamed Eida | Nizar Habash

Egyptian Arabic (EA) NLP resources have mainly focused on Cairene Egyptian Arabic (CEA), leaving sub-dialects like Sa’idi Egyptian Arabic (SEA) underrepresented. This paper introduces the first SEA corpus – an open-source, 4-million-word literary dataset of a dialect spoken by ~30 million Egyptians. To validate its representation, we analyze SEA-specific linguistic features from dialectal surveys, confirming a higher prevalence in our corpus compared to existing EA datasets. Our findings offer insights into SEA’s orthographic representation in morphology, phonology, and lexicon, incorporating CODA* guidelines for normalization.

pdf bib abs
Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions
Mikhail Krasitskii | Olga Kolesnikova | Liliana Chanona Hernandez | Grigori Sidorov | Alexander Gelbukh

This study examines sentiment analysis in Tamil-English code-mixed texts using advanced transformer-based architectures. The unique linguistic challenges, including mixed grammar, orthographic variability, and phonetic inconsistencies, are addressed. Data limitations and annotation gaps are discussed, highlighting the need for larger datasets. The performance of models such as XLM-RoBERTa, mT5, IndicBERT, and RemBERT is evaluated, with insights into their optimization for low-resource, code-mixed environments.

pdf bib abs
Language use of political parties over time: Stylistic Fronting in the Icelandic Gigaword Corpus
Johanna Mechler | Lilja Björk Stefánsdóttir | Anton Ingason

Political speech is an active area of investigation and the ongoing ERC project Explaining Individual Lifespan Change (EILisCh) expands on some of the previous findings in this area. Previous work has found that political speech can differ based on party membership in a time-wise static environment and it has also been uncovered that individual politicians can change their linguistic behavior over time. In this paper, we pursue a novel topic in this area, the evolution of language use of entire political parties over time. We focus on Icelandic political parties and their use of Stylistic Fronting from 1999 to 2021, with a particular emphasis on the years around the financial crisis of 2008, and the subsequent years. Our results show that parties in a position of power typically speak more formally, using more Stylistic Fronting, but that at the same time there are some exceptions to this pattern. We highlight the significance of relying on a large speech corpus, when applying a high-definition approach to linguistic analyses across time.

pdf bib abs
From Causal Parrots to Causal Prophets? Towards Sound Causal Reasoning with Large Language Models
Rahul Babu Shrestha | Simon Malberg | Georg Groh

Causal reasoning is a fundamental property of human and machine intelligence. While large language models (LLMs) excel in many natural language tasks, their ability to infer causal relationships beyond memorized associations is debated. This study systematically evaluates recent LLMs’ causal reasoning across three levels of Pearl’s Ladder of Causation—associational, interventional, and counterfactual—as well as commonsensical, anti-commonsensical, and nonsensical causal structures using the CLadder dataset. We further explore the effectiveness of prompting techniques, including chain of thought (CoT), self-consistency (SC), and causal chain of thought (CausalCoT), in enhancing causal reasoning, and propose two new techniques causal tree of thoughts (CausalToT) and causal program of thoughts (CausalPoT). While larger models tend to outperform smaller ones and are generally more robust against perturbations, our results indicate that all tested LLMs still have difficulties, especially with counterfactual reasoning. However, our CausalToT and CausalPoT significantly improve performance over existing prompting techniques, suggesting that hybrid approaches combining LLMs with formal reasoning frameworks can mitigate these limitations. Our findings contribute to understanding LLMs’ reasoning capacities and outline promising strategies for improving their ability to reason causally as humans would. We release our code and data.

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora—hagiographical and medical texts—we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

pdf bib abs
A Data-driven Investigation of Euphemistic Language: Comparing the usage of “slave” and “servant” in 19th century US newspapers
Jaihyun Park | Ryan Cordell

Warning: This paper contains examples of offensive language targeting marginalized populations. This study investigates the usage of “slave” and “servant” in 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to “slave” and “servant” and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that “slave” is associated with socio-economic, legal, and administrative words, however, “servant” is linked to religious words in the Northern newspapers while Southern newspapers associated “servant” with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

pdf bib abs
It’s about What and How you say it: A Corpus with Stance and Sentiment Annotation for COVID-19 Vaccines Posts on X/Twitter by Brazilian Political Elites
Lorena Barberia | Pedro Schmalz | Norton Trevisan Roman | Belinda Lombard | Tatiane Moraes de Sousa

This paper details the development of a corpus with posts in Brazilian Portuguese published by Brazilian political elites on X (formerly Twitter) regarding COVID-19 vaccines. The corpus consists of 9,045 posts annotated for relevance, stance and sentiment towards COVID-19 vaccines and vaccination during the first three years of the COVID-19 pandemic (2020-2022).Nine annotators, working in three groups, classified relevance, stance, and sentiment in messages posted between 2020 and 2022 by local political elites. The annotators underwent extensive training, and weekly meetings were conducted to ensure intra-group annotation consistency. The analysis revealed fair to moderate inter-annotator agreement (Average Krippendorf’s alpha of 0.94 for relevance, 0,67 for sentiment and 0,70 for stance). This work makes four significant contributions to the literature. First, it addresses the scarcity of corpora in Brazilian Portuguese, particularly on COVID-19 or vaccines in general. Second, it provides a reliable annotation scheme for sentiment and stance classification, distinguishing both tasks, thereby improving classification precision. Third, it offers a corpus annotated with stance and sentiment according to this scheme, demonstrating how these tasks differ and how conflating them may lead to inconsistencies in corpus construction, as a results of confounding these phenomena — a recurring issue in NLP research beyond studies focusing on vaccines. And fourth, this annotated corpus may serve as the gold standard for fine-tuning and evaluating supervised machine learning models for relevance, sentiment and stance analysis of X posts on similar domains.

Digitised historical newspaper collections are becoming increasingly accessible, yet their scale and diverse content still present challenges for researchers interested in specific article types or topics. In a step towards developing models to address these challenges, we have created a dataset of articles from New Zealand’s Papers Past open data annotated with multiple genre and topic labels and annotator confidence scores. Our annotation framework aligns with the perspectivist approach to machine learning, acknowledging the subjective nature of the task and embracing the hybridity and uncertainty of genres. In this paper, we describe our sampling and annotation methods and the resulting dataset of 7,036 articles from 106 New Zealand newspapers spanning the period 1839-1903. This dataset will be used to develop interpretable classification models that enable fine-grained exploration and discovery of articles in Papers Past newspapers based on common aspects of form, function, and topic. The complete dataset, including un-aggregated annotations and supporting documentation, will eventually be openly released to facilitate further research.

pdf bib abs
Development of Old Irish Lexical Resources, and Two Universal Dependencies Treebanks for Diplomatically Edited Old Irish Text
Adrian Doyle | John McCrae

The quantity and variety of Old Irish text which survives in contemporary manuscripts, those dating from the Old Irish period, is quite small by comparison to what is available for Modern Irish, not to mention better-resourced modern languages. As no native speakers have existed for more than a millennium, no more text will ever be created by native speakers. For these reasons, text surviving in contemporary sources is particularly valuable. Ideally, all such text would be annotated using a single, common standard to ensure compatibility. At present, discrete Old Irish text repositories make use of incompatible annotation styles, few of which are utilised by text resources for other languages. This limits the potential for using text from more than any one resource simultaneously in NLP applications, or as a basis for creating further resources. This paper describes the production of the first Old Irish text resources to be designed specifically to ensure lexical compatibility and interoperability.

pdf bib abs
Augmented Close Reading for Classical Latin using BERT for Intertextual Exploration
Ashley Gong | Katy Gero | Mark Schiefsky

Intertextuality, the connection between texts, is a critical literary concept for analyzing classical Latin works. Given the emergence of AI in digital humanities, this paper presents Intertext.AI, a novel interface that leverages Latin BERT (Bamman and Burns 2020), a BERT model trained on classical Latin texts, and contextually rich visualizations to help classicists find potential intertextual connections. Intertext.AI identified over 80% of attested allusions from excerpts of Lucan's Pharsalia, demonstrating the system's technical efficacy. Our findings from a user study with 19 participants also suggest that Intertext.AI fosters intertextual discovery and interpretation more easily than other tools. While participants did not identify significantly different types or quantities of connections when using Intertext.AI or other tools, they overall found finding and justifying potential intertextuality easier with Intertext.AI, reported higher confidence in their observations from Intertext.AI, and preferred having access to it during the search process.

pdf bib abs
An evaluation of Named Entity Recognition tools for detecting person names in philosophical text
Ruben Weijers | Jelke Bloem

For philosophers, mentions of the names of other philosophers and scientists are an important indicator of relevance and influence. However, they don’t always come in neat citations, especially in older works. We evaluate various approaches to named entity recognition for person names in 20th century, English-language philosophical texts. We use part of a digitized corpus of the works of W.V. Quine, manually annotated for person names, to compare the performance of several systems: the rule-based edhiphy, spaCy’s CNN-based system, FLAIR’s BiLSTM-based system, and SpanBERT, ERNIE-v2 and ModernBERT’s transformer-based approaches. We also experiment with enhancing the smaller models with domain-specific embedding vectors. We find that both spaCy and FLAIR outperform transformer-based models, perhaps due to the small dataset sizes involved.

pdf bib abs
Testing Language Creativity of Large Language Models and Humans
Anca Dinu | Andra-Maria Florescu

Since the advent of Large Language Models (LLMs), the interest and need for a better understanding of artificial creativity has increased.This paper aims to design and administer an integrated language creativity test, including multiple tasks and criteria, targeting both LLMs and humans, for a direct comparison. Language creativity refers to how one uses natural language in novel and unusual ways, by bending lexico-grammatical and semantic norms by using literary devices or by creating new words. The results show a slightly better performance of LLMs compared to humans. We analyzed the responses dataset with computational methods like sentiment analysis, clusterization, and binary classification, for a more in-depth understanding. Also, we manually inspected a part of the answers, which revealed that the LLMs mastered figurative speech, while humans responded more pragmatically.

pdf bib abs
Strategies for political-statement segmentation and labelling in unstructured text
Dmitry Nikolaev | Sean Papay

Analysis of parliamentary speeches and political-party manifestos has become an integral area of computational study of political texts. While speeches have been overwhelmingly analysed using unsupervised methods, a large corpus of manifestos with by-statement political-stance labels has been created by the participants of the MARPOR project. It has been recently shown that these labels can be predicted by a neural model; however, the current approach relies on provided statement boundaries, limiting out-of-domain applicability. In this work, we propose and test a range of unified split-and-label frameworks—based on linear-chain CRFs, fine-tuned text-to-text models, and the combination of in-context learning with constrained decoding—that can be used to jointly segment and classify statements from raw textual data. We show that our approaches achieve competitive accuracy when applied to raw text of political manifestos, and then demonstrate the research potential of our method by applying it to the records of the UK House of Commons and tracing the political trajectories of four major parties in the last three decades.

pdf bib abs
Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives
Keerthana Murugaraj | Salima Lamsiyah | Marten During | Martin Theobald

Analyzing historical discourse in large-scale newspaper archives requires scalable and interpretable methods to uncover hidden themes. This study systematically evaluates topic modeling approaches for newspaper articles from 1955 to 2018, comparing probabilistic LDA, matrix factorization NMF, and neural-based models such as Top2Vec and BERTopic across various preprocessing strategies. We benchmark these methods on topic coherence, diversity, scalability, and interpretability. While LDA is commonly used in historical text analysis, our findings demonstrate that BERTopic, leveraging contextual embeddings, consistently outperforms classical models in all tested aspects, making it a more robust choice for large-scale textual corpora. Additionally, we highlight the trade-offs between preprocessing strategies and model performance, emphasizing the importance of tailored pipeline design. These insights advance the field of historical NLP, offering concrete guidance for historians and computational social scientists in selecting the most effective topic-modeling approach for analyzing digitized archives. Our code will be publicly available on GitHub.

pdf bib abs
A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient
Yehor Tereshchenko | Mika Hämäläinen

Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).

pdf bib abs
Threefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs
Mohammed Alnajjar | Khalid Alnajjar | Mika Hämäläinen

This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI’s potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

Socioeconomic status (SES) reflects an individual’s standing in society, from a holistic set of factors including income, education level, and occupation. Identifying individuals in low-SES groups is crucial to ensuring they receive necessary support. However, many individuals may be hesitant to disclose their SES directly. This study introduces a federated learning-powered framework capable of verifying individuals’ SES levels through the analysis of their communications described in natural language. We propose to study language usage patterns among individuals from different SES groups using clustering and topic modeling techniques. An empirical study leveraging life narrative interviews demonstrates the effectiveness of our proposed approach.

pdf bib abs
Team Conversational AI: Introducing Effervesce
Erjon Skenderi | Salla-Maaria Laaksonen | Jukka Huhtamäki

Group conversational AI, especially within digital workspaces, could potentially play a crucial role in enhancing organizational communication. This paper introduces Effervesce, a Large Language Model (LLM) powered group conversational bot integrated into a multi-user Slack environment. Unlike conventional conversational AI applications that are designed for one-to-one interactions, our bot addresses the challenges of facilitating multi-actor conversations. We first evaluated multiple open-source LLMs on a dataset of 1.6k group conversation messages. We then fine-tuned the best performing model using a Parameter Efficient Fine-Tuning technique to better align Effervesce with multi-actor conversation settings. Evaluation through workshops with 40 participants indicates positive impacts on communication dynamics, although areas for further improvement were identified. Our findings highlight the potential of Effervesce in enhancing group communication, with future work aimed at refining the bot’s capabilities based on user feedback.

pdf bib abs
Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas
Venkatesh Bollineni | Igor Crk | Eren Gultepe

Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.

pdf bib abs
EduPo: Progress and Challenges of Automated Analysis and Generation of Czech Poetry
Rudolf Rosa | David Mareček | Tomáš Musil | Michal Chudoba | Jakub Landsperský

This paper explores automated analysis and generation of Czech poetry. We review existing tools, datasets, and methodologies while considering the unique characteristics of the Czech language and its poetic tradition. Our approach builds upon available resources wherever possible, yet requires the development of additional components to address existing gaps. We present and evaluate preliminary experiments, highlighting key challenges and potential directions for future research.

pdf bib abs
A City of Millions: Mapping Literary Social Networks At Scale
Sil Hamilton | Rebecca Hicke | David Mimno | Matthew Wilkens

We release 70,509 high-quality social networks extracted from multilingual fiction and nonfiction narratives. We additionally provide metadata for ~30,000 of these texts (73% nonfiction and 27% fiction) written between 1800 and 1999 in 58 languages. This dataset provides information on historical social worlds at an unprecedented scale, including data for 2,510,021 individuals in 2,805,482 pair-wise relationships annotated for affinity and relationship type. We achieve this scale by automating previously manual methods of extracting social networks; specifically, we adapt an existing annotation task as a language model prompt, ensuring consistency at scale with the use of structured output. This dataset serves as a unique resource for humanities and social science research by providing data on cognitive models of social realities.

pdf bib abs
VLG-BERT: Towards Better Interpretability in LLMs through Visual and Linguistic Grounding
Toufik Mechouma | Ismail Biskri | Serge Robert

We present VLG-BERT, a novel LLM model conceived to improve the language meaning encoding. VLG-BERT provides a deeper insights about meaning encoding in Large Language Models (LLMs) by focusing on linguistic and real-world semantics. It uses syntactic dependencies as a form of a ground truth to supervise the learning process of the words representation. VLG-BERT incorporates visual latent representations from pre-trained vision models and their corresponding labels. A vocabulary of 10k tokens corresponding to so-called concrete words is built by extending the set of ImageNet labels. The extension is based on synonyms, hyponyms and hypernyms from WordNet. Thus, a lookup table for this vocabulary is used to initialize the embedding matrix during training, rather than random initialization. This multimodal grounding provides a stronger semantic foundation for encoding the meaning of words. Its architecture aligns seamlessly with foundational theories from across the cognitive sciences. The integration of visual and linguistic grounding makes VLG-BERT consistent with many cognitive theories. Our approach contributes to the ongoing effort to create models that bridge the gap between language and vision, making them more aligned with how humans understand and interpret the world. Experiments on text classification have shown an excellent results compared to BERT Base.

pdf bib abs
Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish
Kevin Cohen | Laura Manrique-Gómez | Ruben Manrique

This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.

pdf bib abs
Insights into developing analytical categorization schemes: three problem types related to annotation agreement
Pihla Toivanen | Eetu Mäkelä | Antti Kanner

Coding themes, frames, opinions and other attributes are widely used in the social sciences and doing that is also a base for building supervised text classifiers. Coding content needs a lot of resources, and lately this process has been utilized particularly in the training set annotation for machine learning models. Although the objectivity of coding is not always the purpose of coding, it helps in building the machine learning model, if the codings are uniformly done. Usually machine learning models are built by first defining annotation scheme, which contains definitions of categories and instructions for coding. It is known that multiple aspects affect to the annotation results, such as, the domain of annotation, number of annotators, and number of categories in annotation. In this article, we present few more problems that we show to be related with the annotation results in our case study. Those are negated presence of a category, low proportional presence of relevant content and implicit presence of a category. These problems should be resolved in all schemes on the level of scheme definition. To extract our problem categories, we focus on a media research case of extensive data on both the process as well as the results.

pdf bib abs
A Comprehensive Evaluation of Cognitive Biases in LLMs
Simon Malberg | Roman Poletukhin | Carolin Schuster | Georg Groh Groh

We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code and dataset to encourage future research on cognitive biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms.

pdf bib abs
AI with Emotions: Exploring Emotional Expressions in Large Language Models
Shin-nosuke Ishikawa | Atsushi Yoshino

The human-level performance of Large Language Models (LLMs) across various tasks has raised expectations for the potential of artificial intelligence (AI) to possess emotions someday. To explore the capability of current LLMs to express emotions in their outputs, we conducted an experiment using several LLMs (OpenAI GPT, Google Gemini, Meta Llama3, and Cohere Command R+) to role-play as agents answering questions with specified emotional states. We defined the emotional states using Russell’s Circumplex model, a well-established framework that characterizes emotions along the sleepy-activated (arousal) and pleasure-displeasure (valence) axes. We chose this model for its simplicity, utilizing two continuous parameters, which allows for better controllability in applications involving continuous changes in emotional states. The responses generated were evaluated using a sentiment analysis model, independent of the LLMs, trained on the GoEmotions dataset. The evaluation showed that the emotional states of the generated answers were consistent with the specifications, demonstrating the LLMs’ capability for emotional expression. This indicates the potential for LLM-based

pdf bib abs
Fearful Falcons and Angry Llamas: Emotion Category Annotations of Arguments by Humans and LLMs
Lynn Greschner | Roman Klinger

Arguments evoke emotions, influencing the effect of the argument itself. Not only the emotional intensity but also the category influences the argument’s effects, for instance, the willingness to adapt stances. While binary emotionality has been studied in argumentative texts, there is no work on discrete emotion categories (e.g., ‘anger’) in such data. To fill this gap, we crowdsource subjective annotations of emotion categories in a German argument corpus and evaluate automatic LLM-based labeling methods. Specifically, we compare three prompting strategies (zero-shot, one-shot, chain-of-thought) on three large instruction-tuned language models (Falcon-7b-instruct, Llama-3.1-8B-instruct, GPT-4o-mini). We further vary the definition of the output space to be binary (is there emotionality in the argument?), closed-domain (which emotion from a given label set is in the argument?), or open-domain (which emotion is in the argument?). We find that emotion categories enhance the prediction of emotionality in arguments, emphasizing the need for discrete emotion annotations in arguments. Across all prompt settings and models, automatic predictions show a high recall but low precision for predicting anger and fear, indicating a strong bias toward negative emotions.

pdf bib abs
HateImgPrompts: Mitigating Generation of Images Spreading Hate Speech
Vineet Kumar Khullar | Venkatesh Velugubantla | Bhanu Prakash Reddy Rella | Mohan Krishna Mannava | Msvpj Sathvik

The emergence of artificial intelligence has proven beneficial to numerous organizations, particularly in its various applications for social welfare. One notable application lies in AI-driven image generation tools. These tools produce images based on provided prompts. While this technology holds potential for constructive use, it also carries the risk of being exploited for malicious purposes, such as propagating hate. To address this we propose a novel dataset “HateImgPrompts”. We have benchmarked the dataset with the latest models including GPT-3.5, LLAMA 2, etc. The dataset consists of 9467 prompts and the accuracy of the classifier after finetuning of the dataset is around 81%.