Workshop on Natural Language Processing for Digital Humanities (2026)


up

pdf (full)
bib (full)
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities

Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance should be treated as a first-class analytical layer in NLP for digital humanities, supporting reproducibility, source criticism, and uncertainty-aware interpretation.
The "law of conformity," the finding that frequent words are semantically stable, has been treated as a broad regularity of language change. We show it does not hold for Korean. Using diachronic word embeddings trained on historical corpora spanning 500 years (15th–20th centuries), we find a robust positive correlation between frequency and semantic shift: high-frequency Korean words change more, not less. The pattern survives six robustness controls and is validated against an English replication. Partial correlation analysis reveals that the role of polysemy in mediating the frequency–change relationship is not fixed but depends on time resolution and corpus homogeneity. We connect the reversal to frequency-driven reductive processes, including grammaticalization, semantic bleaching, and domain shift, that are especially productive in Korean. The frequency–change relationship is not a fixed regularity but varies with language typology and analytical conditions.
This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: "consistency", measured as cross-replication selection overlap via Jaccard similarity, and "diversity", measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model’s selection profile into a shared space for direct comparison. Results reveal a clear rigidity–exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001–2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks—three-way polarity classification and five-class score classification—and benchmark classical BoW/TF–IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.
The Kṛṣṇa Yajurveda survives in multiple recensions that share substantial ritual content, yet the degree and distribution of textual overlap across recensions have never been quantified systematically. This paper presents a computational analysis of text reuse across three recensions—the Maitrāyaṇī Saṃhitā (MS), the Kāṭhaka Saṃhitā (KS), and the Taittirīya Saṃhitā (TS)—for two ritual sections (Agnyupasthāna and Punarādhāna), using ICoMa (Intertextuality Collation Machine), a new web-based multi-algorithm collation tool. Five independent similarity algorithms consistently rank MS–KS as the most closely related pair, corroborating the philological consensus. Crucially, the two ritual sections exhibit strikingly different reuse profiles: Punarādhāna shows near-identical MS–KS overlap (up to 93.5%) with sharp divergence from TS, while Agnyupasthāna displays moderate, broadly distributed similarity across all three pairs. These contrasting patterns provide quantitative evidence that different ritual categories followed distinct paths of textual transmission within the Yajurvedic tradition. ICoMa and the experimental data are freely available.
Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora—making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 32% of test targets appear identically in training (16/50; 50% under 8-gram overlap at 70% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9–39.2 BLEU / 0.622–0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents—target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9–39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.
Many Tang-poetry emotion studies still rely on coarse labels (e.g., positive/negative), while recent LLM-based attempts face a practical problem: one-word emotion outputs are highly sensitive to prompt wording. When labels shift with phrasing, historical interpretation becomes hard to reproduce and hard to trust. Focusing on Tang poetry around the An Lushan Rebellion (安史之乱), we propose a fine-grained sentence-level workflow centered on emotion embeddings: we use continuous hidden-state vectors, run automatic clustering, and then consolidate labels for interpretation. On the same 3,198 emotional sentences, one-word outputs show only 50.3% A/B exact agreement, while embedding-based clustering remains stable and well distributed (Hnorm=0.989; 20/20 active clusters). On 7,195 labeled sentences, a char-based baseline reaches 0.446 micro-F1 and 0.395 macro-F1. This multi-stage label-construction path supports historically grounded findings, including the emotional turning point around 762, and also reveals layered patterns that are less visible in coarse setups. These results suggest that stable representation is a prerequisite for turning computational outputs into credible evidence for humanities interpretation.
Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.
This research explores the intersection of cultural heritage and Generative AI (Gen-AI), examining AI-generated historical image reconstructions as a potential tool for visualising multiple perspectives in heritage interpretation. In critical heritage studies, the concept of multivocality or polyvocality advocates for representing diverse, often underrepresented, perspectives in how heritage is understood and communicated. We evaluated three prominent AI image generation models across three heritage test cases. A total of 13 user prompts generated 39 images, which underwent both linguistic analysis of intermediate prompt transformations and systematic visual assessment by heritage experts for historical accuracy and cultural sensitivity. The findings revealed both strengths and limitations of the models. While the models produced visually compelling outputs and, in some cases, meaningfully distinct depictions across perspectives, they also exhibited representation imbalances, neutralisation and amplification tendencies, inconsistencies in human portrayal, and misinterpretations introduced during the linguistic transformation of user inputs. Based on these findings, we propose initial guidelines for structured prompt construction that target the specific failure patterns identified. The research suggests that generative AI could serve as a supplementary tool, not a definitive historical source, for exploring multivocal heritage interpretation, particularly in museum and visitor engagement contexts, provided it is used critically and in conjunction with expert input.
Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.
Cross-lingual detection of intertextuality and translation in Latin and Ancient Greek through computational approaches is of great interest for classical studies.While several systems exist for parallel sentence detection, based on general multilingual or specific models for Latin–Ancient Greek, they have not been compared against each other. Therefore, we present a synthetic benchmark to evaluate the performance of language models regarding cross-lingual Ancient Greek and Latin parallel sentence mining. We first compare six language models to encode sentences and then further improve the cross-lingual alignment through post-processing, fine-tuning, and knowledge distillation. We find that the whitening transformation in combination with knowledge distillation provides excellent results. Specifically, SPhilBERTa, a trilingual language model for Ancient Greek and Latin, benefits the most from the improvements and achieves a substantial mining score of 97.6 on our benchmark.
The Aramaic proclitic *dalet*, widely used in historical Hebrew texts, serves two distinct grammatical functions: as a subordinating conjunction and as a possessive preposition. Because these functions are orthographically identical and no annotated resources exist for this task, large-scale computational analysis of their usage has previously been infeasible. In this paper we introduce a new BERT model for historical Hebrew in which all prefixes are segmented and encoded as independent tokens. This representation allows the model to evaluate proclitics directly and provides a probe-based unsupervised method for determining the grammatical role of the *dalet* clitic using masked language modeling predictions. We evaluate the approach on a manually annotated dataset drawn from historical Hebrew literature spanning multiple regions and historical periods, achieving over an average F1 score of over 0.89. Applying the method to a corpus of more than 300 million words of historical Hebrew texts, we conduct large-scale stylistic analyses of the choice between the Aramaic *dalet* and available Hebrew alternatives. The results reveal geographic and diachronic trends and identify distinct stylistic clusters within the corpus. The prefix-segmented model and annotated dataset are released for unrestricted use.
Prior research on cultural markets has relied on genre labels to distinguish products, overlooking the specific content features that differentiate films within the same genre. We address this gap using tropes as building blocks of narrative structure. From a dataset of 30k tropes across 18k films (TVTropes.org), we identify 29 narrative patterns via community detection and characterize each film by two measures: coherence (how concentrated its tropes are within a few patterns) and spanning distance (how far apart the patterns it combines are). Regression analyses show that coherence improves both audience evaluations and attention, while spanning distance increases evaluations but reduces attention. These findings extend category-spanning theory from genre labels to the internal narrative composition of films, demonstrating how stories are constructed and shape audience responses.
Nearly all studies on web registers—online text varieties associated with characteristic social contexts and linguistic features—use full documents as the unit of analysis. However, web documents often contain sections in different registers. A cooking blog, for instance, may combine personal storytelling, recipe instructions, user comments, and promotional text within a single URL. This internal variation raises doubts about the validity of document level register labeling. In this paper, we propose an LLM-based approach that identifies register homogeneous segments within documents and apply it to a 10,000-document English sample from HPLT 3.0. We show that segmentation addresses persistent problems in register analysis, including low inter-annotator agreement and category fuzziness. Strikingly, it also reveals that most web documents contain more than one register, making register mixing the norm rather than the exception on the web.
Identifying intertextual parallels is central to philology, traditionally requiring labor-intensive manual analysis. While digitized historical corpora enable automated approaches using semantic sentence embeddings, training such models requires large annotated datasets, which are scarce for low-resource languages. We address this challenge by introducing a scalable automatic annotation pipeline for training semantic embedding models for Classical Tibetan. Our method combines unsupervised contrastive bootstrapping with iterative pair mining, generating silver-standard similarity labels through two complementary annotation strategies: (1) an ensemble of embedding models and rerankers, and (2) an LLM-as-a-judge committee using best–worst scaling. When combined with a domain-specific, gold-standard annotated dataset for sequential fine-tuning, the resulting text-similarity model achieves a state-of-the-art Spearman correlation of 0.864 on the STS task. This enables effective semantic search in Classical Tibetan and provides a framework for automatic supervision in low-resource languages used in digital humanities. We will make our code, dataset, and trained model publicly available upon publication.
Historical corpora for Tagalog remain limited, particularly texts produced during the Martial Law period under the dictatorship of Ferdinand Marcos Sr. (1972–1986). Much of this material remains undigitized, restricting computational analysis of a significant period in Philippine political history. To support research on historical Tagalog texts, we introduce PHMartialLawNER, a gold-standard named entity recognition corpus constructed from newspapers and underground publications of the Martial Law era. The corpus includes approximately 13k extracted sentence segments (362,000 tokens), consolidated into 8k annotated text spans through a semi-automatic pipeline with manual validation. The reliability of the annotation is measured using Cohen’s 𝜅, reaching 0.86 on all tokens and 0.72 on annotated tokens, with a pairwise F1-score of 0.74. The schema defines historically relevant entity categories including Person (Individual, Collective), Organization (Political, Government, Other), Event (Local, International), Production (Media, Government, Doctrine), as well as Time, Numerical Statistics, Location, and Object entities, specifically identifying weapon artifacts. We establish baseline performance using GLiNER variants, calamanCy models, and transformer-based architectures under zero-shot and few-shot settings. The PHMartialLawNER corpus will be publicly released to support Tagalog NLP, historical text processing, and digital humanities research.
Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.
We investigate narrative agency in hu-man–LLM creative co-writing, asking whodrives story development in turn-based collabo-ration. Using a new corpus of human–LLM co-written stories, we apply sentiment and seman-tic modeling to quantify affective alignmentand semantic novelty in turn-taking, and direc-tional measures to assess which agent shapesnarrative progression. Our results show asym-metric influence: human turns introduce greatersemantic novelty and are more likely to shapesubsequent developments, whereas LLM con-tributions predominantly elaborate on human-introduced elements. At the sentiment level,alignment is also asymmetric, but more bidirec-tional: LLMs exhibit stronger turn-level emo-tional adaptation than humans, but both agentstrack each other’s emotional valence and LLMsshow an independent tendency to more pos-itive emotional baselines. These findings in-dicate a complementary division of labor inhuman–LLM co-writing, where humans drivenarrative innovation and direction, while LLMsact as adaptive amplifiers that sustain coherenceand elaborate emerging narratives.
AI-driven language technologies are increasingly used in hiring, but they may encode and reproduce harmful social stereotypes. Prior work often studies bias mitigation methods in isolation and outside realistic application settings. We examine the combined effects of data-level and model-level debiasing in a hiring-related context, using Norwegian-language academic bios and a proxy STEM/non-STEM classification task. Specifically, we study masking sensitive information, GenWriter-based rewrites (CITATION), and adversarial debiasing (CITATION). We evaluate these interventions using downstream task performance, group fairness metrics, intrinsic bias tests based on WEAT (CITATION), and measures of gender leakage from hidden representations. We find that combining masking, GenWriter rewrites, and adversarial debiasing substantially reduces gender leakage while maintaining or improving downstream performance. However, effects on fairness gaps and intrinsic bias are mixed, underscoring the need for downstream, context-sensitive evaluation of bias mitigation methods in hiring-related NLP.
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke’s foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a “lexical gatekeeping” effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.
How did the thematic repertoire of early English-language science fiction change as the genre consolidated between 1818 and 1930? Using a corpus of 238 public-domain texts, we apply temporally binned latent Dirichlet allocation (LDA), comparing models with and without Authorless preprocessing (which probabilistically downweights author-specific vocabulary). Cross-period topic alignments exceed a permutation null baseline, indicating continuity in topic structure over time. Full-corpus LDA can produce comparable per-topic quality, but only temporal binning enables diachronic alignment; within the binned setting, Authorless reduces author concentration and modestly increases the share of thematic topics without materially reducing coherence. Four high-continuity topic chains – centered on mobility, affect, planetary scale, and scientific knowledge – suggest a shift from earlier romantic and speculative concerns toward more consolidated technoscientific forms. These chains generate interpretable hypotheses about the literary history of early science fiction, and the workflow supports diachronic analysis in small, author-skewed corpora.
This study investigates whether a high-quality, 19-label named entity recogniser for medieval Latin charters can be constructed using only a few hundred annotated sentences. The authors introduce "semantic scaffolding," an innovation that utilizes richly descriptive English label phrases as prompts to activate latent multilingual knowledge within the model. This is paired with a custom span-based architecture utilizing XLM-ROBERTa-large, 4-head attention pooling to handle long property descriptions, and a hybrid loss system including Asymmetric Focal-Dice and InfoNCE contrastive terms. Results demonstrate that semantic scaffolding enables fine-tuned GLiNER to reach 80.8% overlap F1, while the custom architecture achieves 83.4% overlap F1 using only 298 training sentences. Significantly, the paper provides an empirical demonstration that domain-specific pre-training on medieval Latin offers no performance advantage once task-specific fine-tuning is applied. While the model excels at frequent categories like PER (95.7% F1) and LOC (93.5% F1), challenges persist for rare, position-dependent legal categories such as LEG (53.1% F1) and TRANS (52.6% F1).
MotherBoard’s Mother Tongue is a computational linguistics and artistic research project that explores a Large Language Model’s (LLM) vocal production of glossolalia. Glossolalia, colloquially known as ‘speaking in tongues,’ consists of the human production of seemingly unintelligible utterances. It is, by its nature, difficult to annotate accurately with linguistic features relevant for natural language. The glossolalia-producing system demonstrated here consists of the interaction of 1) a ‘nonsense’ linguistic corpus 2) a micro-controller based environmental data stream and 3) a fine-tuned LLM. While discussing some philosophical and artistic considerations of machinic glossolalia, we also address some methodological considerations for Natural Language Processing (NLP). Using the artistic project as a case study, we argue that machinic glossolalia presents a ‘stress test’ that could inform both creative redirections of NLP methods and the definitions held by the subfield.
This paper addresses a practical problem in computational literary history: retrieving adventure novels from a large digitized collection of French fiction where genre metadata are sparse and unreliable. We begin with supervised genre modeling based on a historically situated seed list of 101 adventure novels drawn from literary scholarship. We compare several classifiers and representations, and validate them against 364 independently labeled adventure novels from the Chapitres corpus. The best-performing model, HistGradientBoosting on mean paragraph embeddings, achieves strong external recall (81%) despite the small training set. We then apply this model to the 12,176-novel Fictions littde Gallica archive and refine the resulting candidate corpus through a graph-based post-processing step over a k-nearest-neighbor similarity graph. On the Chapitres benchmark, this graph correction produces negligible changes in retrieval performance, indicating that it is not a generally superior classifier. On Gallica, however, it yields a more cohesive and homogeneous candidate corpus and surfaces interpretable correction cases, including missed canonical adventure novels and excluded borderline texts. We therefore argue that graph-based correction is best understood not as a replacement for supervised classification, but as a heuristic for refining large, noisy archival corpora where exhaustive manual annotation is impossible.
This study examines narratives in which students describe challenges they faced in higher education due to low socioeconomic (SES) backgrounds and the strategies they used to overcome them. Using computational text analysis, we operationalize three educational theories, Paulo Freire’s Critical Pedagogy, Urie Bronfenbrenner’s Ecological Systems Theory, and Pierre Bourdieu’s Theory of Capital and Habitus to analyze patterns in these narratives. To strengthen the theory-to-method connection, we incorporate temporal timeline extraction, identifying ordered event sequences and tracking how challenges and forms of capital evolve across a student’s posting history. This temporal lens links theoretical categories (barriers, supports, forms of capital) to when they occur, highlighting moments for timely interventions. By combining theory-driven features with temporal analysis, we evaluate the explanatory capacity of each framework and demonstrate how computational methods can quantitatively examine qualitative lived experience at scale, supporting interdisciplinary research on equity in education.
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.
Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings.Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP.These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility.
This paper introduces a computational frameworkfor evaluating structural properties ofthe undeciphered Indus script.The study usesa corpus of 6,579 inscriptions.The analyticalapproach combines unsupervised visual clusteringof sign morphology, entropy-based sequenceanalysis, Kullback-Leibler divergencecomparison, and neural sequence modeling(BiLSTM). The results indicate directionalasymmetry and structured combinatorial patternsin sign sequences. We conclude that theIndus sign sequences exhibit statistical propertiesconsistent with structured symbolic systemsand not easily explained by random generation.
We present the result of preliminary explorations of using the topology of embedded manifolds as a semantic invariant. Our main question is whether the topology of large embedded corpora is invariant in the following two senses. First, one might reasonably expect that the same corpus in two languages would give topologically equivalent embeddings. Second, one might reasonably expect that the same corpus embedded by two different embedding models might give topologically equivalent embeddings. In the paper we will justify these intuitions and give preliminary results indicating that they are, to some extent, justified.
Automated Essay Scoring (AES) is shifting from feature-engineering to LLMs, yet current training-free approaches struggle with calibration, often exhibiting a "middle-score bias" that fails to distinguish between exceptional and weak writings. In this work, we introduce MADRAG (Multi-Agent Debate with Retrieval-Augmented Generation), a training-free framework designed to achieve the reliability of supervised models without the need for labeled training data. MADRAG decomposes the scoring process into a multi-agent interaction: an Advocate highlights essay strengths, a Skeptic critiques weaknesses, and a Judge synthesizes these arguments to assign a score. Crucially, we augment the Judge with RAG mechanism that retrieves rubric-aligned exemplar essays spanning the full score range, grounding the debate in concrete evidence. Evaluating our approach on the ASAP dataset for analytic trait scoring, we demonstrate that MADRAG significantly outperforms existing prompt-based LLM baselines and achieves performance competitive with state-of-the-art supervised models.
Research on online cultural production shows that platforms are acting as mediators that can heavily shape textual form. Yet, empirical work is often platform-bounded, making it difficult to assess whether stylistic regularities that we observe are indeed genre signals or if some of them are platform artefacts. We address this question through a cross-platform design focused on creepypasta, a digital-born horror genre circulating across heterogeneous infrastructures. Using a corpus of 23,000 English-language stories published from 2007 to 2024 on Reddit’s /r/nosleep and the Creepypasta Fandom wiki, we compare stylistic profiles across platforms and relate them to differences in rule regimes and moderation practices, established through qualitative extraction and close reading of platform guidelines. Across readability indices, lexical diversity measures, syntactic proxies, and a cross-fit feature-based model, we find that platform membership leaves only a narrow stylistic imprint, largely reducible to a single architectural rule: r/NoSleep’s mandatory first-person narration. Beyond this constraint, differences are modest and fail to form coherent platform-specific stylistic signatures. This helps us define what is stylistically common in creepypastas, and understand what the genre is to its writers beyond the topics it deals with or the platform it is written on.
While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism’s outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.
This study proposes and tests loudness standard deviation (SD) of fictional sound events as an acoustically grounded proxy for detecting explicit content in romance fiction. Working with a subcorpus of novels from the Harlequin Men Made in America series, scenes were annotated for character and ambient sound with loudness levels. Additionally, the scenes were annotated on a ternary severity scale with two content advisory categories drawn from the PG-story taxonomy, Sex & Nudity and Violence & Scariness (CITATION), and tested whether within-scene loudness SD of character and ambient sound correlates with either category. Loudness standard deviation analyses of character and ambient sounds in scenes featuring explicit content reveal that erotic scenes are acoustically marked by significantly higher variability in character-produced sounds, reflecting the dynamic range from whispered dialogue to vocalized arousal, while no significant correlation was found between high ambient sound loudness SD and scenes of elevated Violence & Scariness.
This study introduces and analyzes a novel authorship attribution case: the children’s stories published by Oscar and Constance Wilde. We analyzed the corpus of stories with both supervised (SVM with string kernel) and unsupervised (Hierarchical Clustering via Rank Distance) methods and found a strong stylistic similarity between the story "The Selfish Giant" published by Oscar Wilde and the stylometric profile of Constance Wilde. Starting from this baseline, we also explored the the capabilities of LLMs in authorship attribution via Perplexity. Our finding suggests that the story "The Selfish Giant" might be the result of a collaboration between Oscar and Constance Wilde. Moreover, our results pointed to the distinct stylistic fingerprints of the two authors with regards to the rest of the corpus, confirming that their respective styles are separable despite shared genre and period.
This work investigates the extent to which open-source Large Language Models (LLMs) can improve accessibility of unstructured historical documents by performing abstractive summarization and fine-grained Named Entity Recognition (NER) for role classification and violation types. We evaluate open-source LLMs in zero-shot settings and apply these tasks to witness testimonies collected by the South African Truth and Reconciliation Commission (TRC), which archived a large body of text documenting human rights violations during apartheid. Despite their historical significance, these texts are difficult to access due to their length, lack of standardized structure, and the absence of systematic indexing.Open-source LLMs show strong performance in summarization, with most models surpassing non-LLM baselines (maximum BERTScore 0.77), while NER performance remains limited (maximum F1-score 0.61). Results suggest a trade-off in which stylistic fluency is prioritized over factual precision. A two-stage pipeline, summarization followed by NER on LLM summaries, leads to measurable improvements.
This paper introduces Perspectives, an interactive extension of a qualitative data analysis tool suite developed at our university, designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives’s interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.