Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko


Abstract
Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with LLM-as-a-judge approaches together with experiments on task-specific fine-tuned models. Our analysis reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches. We also provide actionable insights and practical guidelines for building robust and reliable multilingual evaluation pipelines for text detoxification and related TST tasks.
Anthology ID:
2026.lrec-main.358
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
4560–4574
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.358/
DOI:
Bibkey:
Cite (ACL):
Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, and Alexander Panchenko. 2026. Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification. International Conference on Language Resources and Evaluation, main:4560–4574.
Cite (Informal):
Evaluating Text Style Transfer: A Nine-language Benchmark for Text Detoxification (Protasov et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.358.pdf