Active Learning for Corpus Refinement: Cost-Effective Preprocessing to Improve Validity of Applied Quantitative Text Analysis

Jakob Steglich, Stephan Poppe


Abstract
Quantitative text analysis relies on high-quality corpora, but keyword-based collection often retrieves irrelevant material, undermining validity. We show that active learning with a transformer-based classifier can iteratively refine corpora by excluding irrelevant documents, prompting researchers to clarify inclusion criteria and address edge cases. Applied to German newspaper articles on depression and schizophrenia, this approach improves construct validity and reduces labeling effort. The document relevance classifiers reached an F1-score of 0.8 with just 100–150 labeled snippets, with further gains from tuning, outperforming both random sampling and a weakly supervised sampling baseline. Filtering non-medical articles further had little effect on downstream depression stigmatization measures but increased schizophrenia stigmatization. Active learning thus enables efficient corpus validation and clearer concept boundaries with minimal preprocessing. The source code is publicly available at https://github.com/jakobstgl/active-learning-corpus-refinement.
Anthology ID:
2026.eacl-srw.70
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
952–966
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.70/
DOI:
Bibkey:
Cite (ACL):
Jakob Steglich and Stephan Poppe. 2026. Active Learning for Corpus Refinement: Cost-Effective Preprocessing to Improve Validity of Applied Quantitative Text Analysis. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 952–966, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Active Learning for Corpus Refinement: Cost-Effective Preprocessing to Improve Validity of Applied Quantitative Text Analysis (Steglich & Poppe, EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.70.pdf