Raven Adam
2025
Bidirectional Topic Matching: Quantifying Thematic Intersections Between Climate Change and Climate Mitigation News Corpora Through Topic Modelling
Raven Adam
|
Marie Kogler
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)
Bidirectional Topic Matching (BTM) is a novel method for cross-corpus topic modeling that quantifies thematic overlap and divergence between corpora. BTM is a flexible framework that can incorporate various topic modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet Allocation (LDA). It employs a dual-model approach, training separate topic models for each corpus and applying them reciprocally to enable comprehensive cross-corpus comparisons. This methodology facilitates the identification of shared themes and unique topics, providing nuanced insights into thematic relationships. A case study on climate news articles illustrates BTM’s utility by analyzing two distinct corpora: news coverage on climate change and articles focused on climate mitigation. The results reveal significant thematic overlaps and divergences, shedding light on how these two aspects of climate discourse are framed in the media.
2024
Extracting position titles from unstructured historical job advertisements
Klara Venglarova
|
Raven Adam
|
Georg Vogeler
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
This paper explores the automated extraction of job titles from unstructured historical job advertisements, using a corpus of digitized German-language newspapers from 1850-1950. The study addresses the challenges of working with unstructured, OCR-processed historical data, contrasting with contemporary approaches that often use structured, digitally-born datasets when dealing with this text type. We compare four extraction methods: a dictionary-based approach, a rule-based approach, a named entity recognition (NER) mode, and a text-generation method. The NER approach, trained on manually annotated data, achieved the highest F1 score (0.944 using transformers model trained on GPU, 0.884 model trained on CPU), demonstrating its flexibility and ability to correctly identify job titles. The text-generation approach performs similarly (0.920). However, the rule-based (0.69) and dictionary-based (0.632) methods reach relatively high F1 Scores as well, while offering the advantage of not requiring extensive labeling of training data. The results highlight the complexities of extracting meaningful job titles from historical texts, with implications for further research into labor market trends and occupational history.