Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning

Maximilian Bley, Thomas Eckart, Christopher Schröder


Abstract
The quality of training data is an essential factor for training large language models (LLMs) as it directly impacts their performance. While high-quality data is crucial for training competitive LLMs, existing preprocessing pipelines still partly rely on rules, which are computationally cheap but also inherently limited to simpler patterns. Model-based filtering on the other hand, is more flexible and can detect finer-grained patterns and semantics, but often requires substantial amounts of labeled data. While there are existing models for common problems (such as toxicity classification), this is often only the case for resource-rich languages and well-studied problems—leaving gaps in coverage for other languages, problems, or combinations thereof. In this work, we investigate the feasibility of model-based preprocessing despite the absence of labeled data. We use active learning to bootstrap a sentence-level multi-label classifier that detects textual problems of traditional text cleaning approaches. With only 498 examples, the final classifier reaches macro- and micro-F1 scores of 0.80 and 0.84, making it suitable for practical use. Moreover, we find that it captured subtle errors compared to a rule-based baseline. We publish the training code, a labeled corpus quality classification dataset, and the resulting classifier.
Anthology ID:
2025.globalnlp-1.12
Volume:
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Sudhansu Bala Das, Pruthwik Mishra, Alok Singh, Shamsuddeen Hassan Muhammad, Asif Ekbal, Uday Kumar Das
Venues:
GlobalNLP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, BULGARIA
Note:
Pages:
98–109
Language:
URL:
https://preview.aclanthology.org/corrections-2026-01/2025.globalnlp-1.12/
DOI:
Bibkey:
Cite (ACL):
Maximilian Bley, Thomas Eckart, and Christopher Schröder. 2025. Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning. In Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, pages 98–109, Varna, Bulgaria. INCOMA Ltd., Shoumen, BULGARIA.
Cite (Informal):
Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning (Bley et al., GlobalNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2026-01/2025.globalnlp-1.12.pdf