Fawzia Zeitoun

2026

Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges
Anuradha Welivita | Fawzia Zeitoun | Pearl Pu
Proceedings of the 30th Conference on Computational Natural Language Learning

This paper compares the empathetic quality of responses generated by humans and large language models (LLMs). We evaluate four LLMs that were widely used at the time of study—GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8×7B-Instruct—against a human baseline using a large-scale between-subjects study. A total of 1,000 human participants evaluated the empathetic quality of human- and LLM-generated responses to 2,000 dialogue prompts spanning 32 positive and negative emotions. To complement human judgments, we also employed an LLM-as-judge (GPT-4o-mini) to assess the same responses. Across emotions and evaluators, LLM-generated responses were rated as significantly more empathetic than human-written responses. We also observed that both human judges and the LLM-as-judge tended to rate responses generated by their own group more favorably, indicating self-favoring tendencies. These findings highlight both the strong performance of contemporary LLMs in empathetic responding and the need to interpret human- and LLM-based evaluations with care.

Co-authors

Pearl Pu 1
Anuradha Welivita 1

Venues

CoNLL1
WS1

Fix author