Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification
Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko
Abstract
Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with LLM-as-a-judge approaches together with experiments on task-specific fine-tuned models. Our analysis reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches. We also provide actionable insights and practical guidelines for building robust and reliable multilingual evaluation pipelines for text detoxification and related TST tasks.- Anthology ID:
- 2026.lrec-main.358
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 4560–4574
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.358/
- DOI:
- Cite (ACL):
- Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, and Alexander Panchenko. 2026. Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification. International Conference on Language Resources and Evaluation, main:4560–4574.
- Cite (Informal):
- Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification (Protasov et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.358.pdf