Do NOT Classify and Count: Hybrid Attribute Control Success Evaluation

Felix Matthias Saaro, Pius von Däniken, Mark Cieliebak, Jan Milan Deriu


Abstract
Evaluating attribute control success in controllable text generation and related generation tasks typically relies on pretrained classifiers. We show that this widely used classify-and-count approach yields biased and inconsistent results, with estimates varying significantly across classifiers. We frame control success estimation as a quantification task and apply a hybrid Bayesian method that combines classifier predictions with a small number of human labels for calibration. To test our approach, we collected a two-modality test dataset consisting of 600 human-rated samples and 60,000 automatically rated samples. Our experiments show that our approach produces robust estimates of control success across both text and text-to-image generation tasks, offering a principled alternative to current evaluation practices.
Anthology ID:
2026.eacl-long.48
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1101–1114
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.48/
DOI:
Bibkey:
Cite (ACL):
Felix Matthias Saaro, Pius von Däniken, Mark Cieliebak, and Jan Milan Deriu. 2026. Do NOT Classify and Count: Hybrid Attribute Control Success Evaluation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1101–1114, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Do NOT Classify and Count: Hybrid Attribute Control Success Evaluation (Saaro et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.48.pdf