Chenxin An
2024
L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Chenxin An
|
Shansan Gong
|
Ming Zhong
|
Xingjian Zhao
|
Mukai Li
|
Jun Zhang
|
Lingpeng Kong
|
Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, there has been growing interest in long-context scaling of large language models (LLMs). To facilitate research in this field, we propose L-Eval to institute a more standardized evaluation for Long-Context Language Models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and more than 2,000 human-labeled query-response pairs including diverse task types, domains, and input length (3k~200k tokens). On the other hand, we investigate the effectiveness of evaluation metrics for LCLMs and we show that Length-instruction-enhanced (LIE) evaluation and LLM judges can better correlate with human judgments. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of a more principled evaluation of these models.
2022
CoLo: A Contrastive Learning Based Re-ranking Framework for One-Stage Summarization
Chenxin An
|
Ming Zhong
|
Zhiyong Wu
|
Qin Zhu
|
Xuanjing Huang
|
Xipeng Qiu
Proceedings of the 29th International Conference on Computational Linguistics
Traditional training paradigms for extractive and abstractive summarization systems always only use token-level or sentence-level training objectives. However, the output summary is always evaluated from summary-level which leads to the inconsistency in training and evaluation. In this paper, we propose a Contrastive Learning based re-ranking framework for one-stage summarization called CoLo. By modeling a contrastive objective, we show that the summarization model is able to directly generate summaries according to the summary-level score without additional modules and parameters. Extensive experiments demonstrate that CoLo boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score while preserving the parameter efficiency and inference efficiency. Compared with state-of-the-art multi-stage systems, we save more than 100 GPU training hours and obtaining 3x 8x speed-up ratio during inference while maintaining comparable results.
Search
Co-authors
- Ming Zhong 2
- Xipeng Qiu 2
- Shansan Gong 1
- Xingjian Zhao 1
- Mukai Li 1
- show all...