Vera Danilova

2026

Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
Vera Danilova | Murathan Kurfalı | Ylva Söderfeldt | Julia Reed | Andrew Burchell
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

pdf bib abs

An Enhanced Training-Free Pipeline for Entity Recognition and Linking: A Low-Resource Case Study – 20-th Century Historical Medical Texts
Phu-Vinh Nguyen | Vera Danilova
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

Entity linking in biomedicine typically relies on large annotated corpora and supervised methods, which often fail in out-of-distribution settings. Historical medical texts are rich in biomedical terms but pose unique challenges: terminology has changed, some concepts are obsolete, and stylistic differences from modern journals prevent off-the-shelf models fine-tuned on contemporary datasets from aligning historical terms with current ontologies. Training-free methods based on LLMs offer a solution by linking historical terms to modern concepts and inferring their meaning from context. In this paper, we evaluate a state-of-the-art training-free entity linking method on historical medical texts and propose an improved pipeline—end-to-end entity extraction and linking with confidence estimation. We also assess performance on modern benchmarks to check whether the gains generalize to other domains and show their superior performance in most cases. We report an analysis of the findings. The code and curated dataset for historical medical entity linking are available on GitHub.

2025

pdf bib abs

Classifying Textual Genre in Historical Magazines (1875-1990)
Vera Danilova | Ylva Söderfeldt
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

Historical magazines are a valuable resource for understanding the past, offering insights into everyday life, culture, and evolving social attitudes. They often feature diverse layouts and genres. Short stories, guides, announcements, and promotions can all appear side by side on the same page. Without grouping these documents by genre, term counts and topic models may lead to incorrect interpretations.This study takes a step towards addressing this issue by focusing on genre classification within a digitized collection of European medical magazines in Swedish and German. We explore 2 scenarios: 1) leveraging the available web genre datasets for zero-shot genre prediction, 2) semi-supervised learning over the few-shot setup. This paper offers the first experimental insights in this direction.We find that 1) with a custom genre scheme tailored to historical dataset characteristics it is possible to effectively utilize categories from web genre datasets for cross-domain and cross-lingual zero-shot prediction, 2) semi-supervised training gives considerable advantages over few-shot for all models, particularly for the historical multilingual BERT.

pdf bib abs

Post-OCR Correction of Historical German Periodicals using LLMs
Vera Danilova | Gijs Aangenendt
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Optical Character Recognition (OCR) is critical for accurate access to historical corpora, providing a foundation for processing pipelines and the reliable interpretation of historical texts. Despite advances, the quality of OCR in historical documents remains limited, often requiring post-OCR correction to address residual errors. Building on recent progress with instruction-tuned Llama 2 models applied to English historical newspapers, we examine the potential of German Llama 2 and Mistral models for post-OCR correction of German medical historical periodicals. We perform instruction tuning using two configurations of training data, augmenting our small annotated dataset with two German datasets from the same time period. The results demonstrate that German Mistral enhances the raw OCR output, achieving a lower average word error rate (WER). However, the average character error rate (CER) either decreases or remains unchanged across all models considered. We perform an analysis of performance within the error groups and provide an interpretation of the results.

2024

pdf bib abs

Relation between Cross-Genre and Cross-Topic Transfer in Dependency Parsing
Vera Danilova | Sara Stymne
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Matching genre in training and test data has been shown to improve dependency parsing. However, it is not clear whether the used methods capture only the genre feature. We hypothesize that successful transfer may also depend on topic similarity. Using topic modelling, we assess whether cross-genre transfer in dependency parsing is stable with respect to topic distribution. We show that LAS scores in cross-genre transfer within and across treebanks typically align with topic distances. This indicates that topic is an important explanatory factor for genre transfer.

Vera Danilova

2026

2025

2024

2023

2013

Co-authors

Venues