This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
LiXiaoqing
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Data is the lifeblood of large language models (LLMs). While the quantity of open-source data available for training LLMs is substantial, its integrity often falls short. For instance, the open-source chat version of Yi-1.5-9B scores 5.20 on AlignBench, while the Chinese Alpaca-GPT4 version scores 4.12. This discrepancy makes it challenging for developers to create models that excel in downstream tasks and instruction following. Therefore, it is essential to improve data integrity. Currently, there are two mainstream methods for enhancing data integrity: data filtering and data augmentation. Due to the labor-intensive and time-consuming nature of performing these tasks manually, some of these efforts are now being undertaken by LLMs, owing to their high alignment with human preferences. However, we have found that performing data filtering or data augmentation with LLMs has limited effectiveness in improving data integrity. In this work, we propose FiNE (Filtering and Improving Noisy data Elaborately), a method that performs refined filtering and improvement of training data with LLMs. Using the data obtained through our method to train Yi-1.5-9B, the performance gap on AlignBench between our model and the open-source chat version is reduced from 1.08 to 0.35. Additionally, on HalluQA, our model surpasses the open-source chat version by 8.45.
With the rapid development of large language models (LLMs), due to their strong performance across various fields, LLM-based evaluation methods (LLM-as-a-Judge) have become widely used in natural language generation (NLG) evaluation. However, these methods encounter the following challenges: (1) distinguishing instruction-following ability, (2) being applicable across diverse NLG tasks, and (3) identifying low-quality outputs. To address these issues, we propose CAMIEval, a multidimensional comparative evaluation method based on instruction-following. Specifically, we define three fundamental dimensions of instruction-following: relevance, factuality, and adherence. Subsequently, we introduce a concrete Chain-of-Thoughts (ConcreteCoT) process to enhance the accuracy of evaluations. In addition, we trained a “regrettable model” RegretLM to generate low-quality outputs, which helps the evaluator better identify the potential shortcomings of the candidate output by comparing low-quality outputs with reference outputs. Through this comparison, the evaluator can generate instruction-specific dimensions that complement the fundamental dimensions, forming a more comprehensive evaluation metric system. Experiments on two NLG evaluation benchmarks demonstrate that CAMIEval consistently outperforms existing methods in terms of correlation with human evaluations, providing a general and accurate framework for evaluating the outputs of LLMs.