2025
pdf
bib
abs
Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
Stefanie Urchs
|
Veronika Thurner
|
Matthias Aßenmacher
|
Christian Heumann
|
Stephanie Thiemichen
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems
Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980–2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.
pdf
bib
abs
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
Stefanie Urchs
|
Veronika Thurner
|
Matthias Aßenmacher
|
Christian Heumann
|
Stephanie Thiemichen
Findings of the Association for Computational Linguistics: ACL 2025
Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However,large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. Wepresent taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts fromtaz, spanning 1980 to 2024.As a demonstration of the corpus’s utility for bias and discrimination research, we analyse gender representation across four decades ofreporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Usinga scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in Germanjournalistic texts.The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available tofoster inclusive and reproducible research in German-language NLP.
2024
pdf
bib
abs
Detecting Gender Discrimination on Actor Level Using Linguistic Discourse Analysis
Stefanie Urchs
|
Veronika Thurner
|
Matthias Aßenmacher
|
Christian Heumann
|
Stephanie Thiemichen
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
With the usage of tremendous amounts of text data for training powerful large language models such as ChatGPT, the issue of analysing and securing data quality has become more pressing than ever. Any biases, stereotypes and discriminatory patterns that exist in the training data can be reproduced, reinforced or broadly disseminated by the models in production. Therefore, it is crucial to carefully select and monitor the text data that is used as input to train the model. Due to the vast amount of training data, this process needs to be (at least partially) automated. In this work, we introduce a novel approach for automatically detecting gender discrimination in text data on the actor level based on linguistic discourse analysis. Specifically, we combine existing information extraction (IE) techniques to partly automate the qualitative research done in linguistic discourse analysis. We focus on two important steps: Identifying the respectiveperson-named-entity (an actor) and all forms it is referred to (Nomination), and detecting the characteristics it is ascribed (Predication). Asa proof of concept, we integrate these two steps into a pipeline for automated text analysis. The separate building blocks of the pipeline could be flexibly adapted, extended, and scaled for bigger datasets to accommodate a wide range of usage scenarios and specific ML tasks or help social scientists with analysis tasks. We showcase and evaluate our approach on several real and simulated exemplary texts.