Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen


Abstract
Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980–2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.
Anthology ID:
2025.eval4nlp-1.5
Volume:
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Mousumi Akter, Tahiya Chowdhury, Steffen Eger, Christoph Leiter, Juri Opitz, Erion Çano
Venues:
Eval4NLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
55–65
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.eval4nlp-1.5/
DOI:
Bibkey:
Cite (ACL):
Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, and Stephanie Thiemichen. 2025. Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, pages 55–65, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora (Urchs et al., Eval4NLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.eval4nlp-1.5.pdf