Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics

Sadegh Jafari, Els Lefever, Veronique Hoste


Abstract
Evaluating emotion generation in large language models (LLMs) remains a challenging problem due to the subjective nature of emotions and the lack of reliable automatic evaluation metrics. In this paper, we introduce a robust and extensible benchmark for systematically assessing automatic metrics in emotion generation tasks. The benchmark currently includes 13 automatic evaluation metrics and five state-of-the-art LLMs, and can be easily extended without requiring additional human annotations. Through a correlation analysis with human evaluations on a carefully curated annotated subset, we identify the emotion recognition score (ERS) metric, computed with gpt-5-nano in an oneshot setting, as the most reliable automatic evaluator, achieving a correlation exceeding 0.99. Interestingly, despite relying on the same underlying LLM, the emotion absolute score (EAS) metric shows a negative correlation, demonstrating that LLM strength alone does not guarantee automatic metric alignment with human judgment. We also provide lightweight, non-LLM-based alternatives, R2_m and R3_m, in the emotion analogy score (EAnS) metric family, suitable for low-resource settings where large models are not accessible. A comprehensive per-class emotion analysis further highlights the strengths and weaknesses of the evaluated models. Overall, our results offer a practical and scalable framework for benchmarking emotion generation evaluation metrics and pave the way for more reliable, fair, and interpretable emotional language evaluation.
Anthology ID:
2026.lrec-main.222
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
2836–2847
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.222/
DOI:
Bibkey:
Cite (ACL):
Sadegh Jafari, Els Lefever, and Veronique Hoste. 2026. Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics. International Conference on Language Resources and Evaluation, main:2836–2847.
Cite (Informal):
Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics (Jafari et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.222.pdf