Abstract
This paper compares the two most widely used techniques for evaluating generative tasks with large language models (LLMs): prompt-based evaluation and log-likelihood evaluation as part of the Eval4NLP shared task. We focus on the summarization task and evaluate both small and large LLM models. We also study the impact of LLAMA and LLAMA 2 on summarization, using the same set of prompts and techniques. We used the Eval4NLP dataset for our comparison. This study provides evidence of the advantages of prompt-based evaluation techniques over log-likelihood based techniques, especially for large models and models with better reasoning power.- Anthology ID:
- 2023.eval4nlp-1.12
- Volume:
- Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
- Month:
- November
- Year:
- 2023
- Address:
- Bali, Indonesia
- Editors:
- Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao, Christoph Leiter, Juri Opitz, Andreas Rücklé
- Venues:
- Eval4NLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 149–155
- Language:
- URL:
- https://aclanthology.org/2023.eval4nlp-1.12
- DOI:
- 10.18653/v1/2023.eval4nlp-1.12
- Cite (ACL):
- Abhishek Pradhan and Ketan Todi. 2023. Understanding Large Language Model Based Metrics for Text Summarization. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, pages 149–155, Bali, Indonesia. Association for Computational Linguistics.
- Cite (Informal):
- Understanding Large Language Model Based Metrics for Text Summarization (Pradhan & Todi, Eval4NLP-WS 2023)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2023.eval4nlp-1.12.pdf