Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics

Sadegh Jafari; Els Lefever; Veronique Hoste

Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics

Sadegh Jafari, Els Lefever, Veronique Hoste

Abstract

Evaluating emotion generation in large language models (LLMs) remains a challenging problem due to the subjective nature of emotions and the lack of reliable automatic evaluation metrics. In this paper, we introduce a robust and extensible benchmark for systematically assessing automatic metrics in emotion generation tasks. The benchmark currently includes 13 automatic evaluation metrics and five state-of-the-art LLMs, and can be easily extended without requiring additional human annotations. Through a correlation analysis with human evaluations on a carefully curated annotated subset, we identify the emotion recognition score (ERS) metric, computed with gpt-5-nano in an oneshot setting, as the most reliable automatic evaluator, achieving a correlation exceeding 0.99. Interestingly, despite relying on the same underlying LLM, the emotion absolute score (EAS) metric shows a negative correlation, demonstrating that LLM strength alone does not guarantee automatic metric alignment with human judgment. We also provide lightweight, non-LLM-based alternatives, R2_m and R3_m, in the emotion analogy score (EAnS) metric family, suitable for low-resource settings where large models are not accessible. A comprehensive per-class emotion analysis further highlights the strengths and weaknesses of the evaluated models. Overall, our results offer a practical and scalable framework for benchmarking emotion generation evaluation metrics and pave the way for more reliable, fair, and interpretable emotional language evaluation.

Anthology ID:: 2026.lrec-main.222
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 2836–2847
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.222/
DOI:
Bibkey:
Cite (ACL):: Sadegh Jafari, Els Lefever, and Veronique Hoste. 2026. Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics. International Conference on Language Resources and Evaluation, main:2836–2847.
Cite (Informal):: Towards Reliable Evaluation of Emotional Text Generation in LLMs: Human vs. Automatic Metrics (Jafari et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.222.pdf

PDF Cite Search Fix data