Abstract
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.- Anthology ID:
- 2024.findings-emnlp.227
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3941–3960
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.227/
- DOI:
- 10.18653/v1/2024.findings-emnlp.227
- Cite (ACL):
- Yuho Lee, Taewon Yun, Jason Cai, Hang Su, and Hwanjun Song. 2024. UniSumEval: Towards Unified, Fine-grained, Multi-dimensional Summarization Evaluation for LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3941–3960, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- UniSumEval: Towards Unified, Fine-grained, Multi-dimensional Summarization Evaluation for LLMs (Lee et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.227.pdf