LLM-based NLG Evaluation: Current Status and Challenges

Mingqi Gao, Xinyu Hu, Xunjian Yin, Jie Ruan, Xiao Pu, Xiaojun Wan


Abstract
Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g., n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human–LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.
Anthology ID:
2025.cl-2.9
Volume:
Computational Linguistics, Volume 51, Issue 2 - June 2025
Month:
June
Year:
2025
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
Note:
Pages:
661–687
Language:
URL:
https://preview.aclanthology.org/corrections-2025-07/2025.cl-2.9/
DOI:
10.1162/coli_a_00561
Bibkey:
Cite (ACL):
Mingqi Gao, Xinyu Hu, Xunjian Yin, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2025. LLM-based NLG Evaluation: Current Status and Challenges. Computational Linguistics, 51:661–687.
Cite (Informal):
LLM-based NLG Evaluation: Current Status and Challenges (Gao et al., CL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-07/2025.cl-2.9.pdf