Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges

Anuradha Welivita; Fawzia Zeitoun; Pearl Pu

Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges

Anuradha Welivita, Fawzia Zeitoun, Pearl Pu

Abstract

This paper compares the empathetic quality of responses generated by humans and large language models (LLMs). We evaluate four LLMs that were widely used at the time of study—GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8×7B-Instruct—against a human baseline using a large-scale between-subjects study. A total of 1,000 human participants evaluated the empathetic quality of human- and LLM-generated responses to 2,000 dialogue prompts spanning 32 positive and negative emotions. To complement human judgments, we also employed an LLM-as-judge (GPT-4o-mini) to assess the same responses. Across emotions and evaluators, LLM-generated responses were rated as significantly more empathetic than human-written responses. We also observed that both human judges and the LLM-as-judge tended to rate responses generated by their own group more favorably, indicating self-favoring tendencies. These findings highlight both the strong performance of contemporary LLMs in empathetic responding and the need to interpret human- and LLM-based evaluations with care.

Anthology ID:: 2026.conll-main.21
Volume:: Proceedings of the 30th Conference on Computational Natural Language Learning
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Claire Bonial, Yevgeni Berzak
Venues:: CoNLL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 358–381
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.21/
DOI:
Bibkey:
Cite (ACL):: Anuradha Welivita, Fawzia Zeitoun, and Pearl Pu. 2026. Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 358–381, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges (Welivita et al., CoNLL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.21.pdf

PDF Cite Search Fix data