Chi Zhang


2025

pdf bib
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
Tianyi Bai | Ling Yang | Zhen Hao Wong | Fupeng Sun | Xinlin Zhuang | Jiahui Peng | Chi Zhang | Lijun Wu | Qiu Jiantao | Wentao Zhang | Binhang Yuan | Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism. Each data selection method independently prioritizes data based on its specific criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection. Additionally, a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.

pdf bib
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang | Jiahui Peng | Ren Ma | Yinfan Wang | Tianyi Bai | Xingjian Wei | Qiu Jiantao | Chi Zhang | Ying Qian | Conghui He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality—a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23%, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.

pdf bib
An Automatic Method to Estimate Correctness of RAG
Chi Zhang | Vivek V. Datla | Aditya Shrivastava | Alfy Samuel | Zhiqi Huang | Anoop Kumar | Daben Liu
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

In sectors in where data quality is critical, like finance and healthcare, it is crucial to have confidence in not only the outputs generated by retrieval-augmented generation (RAG) models but also the process followed by the model while arriving at the output. Existing methods, such as hallucination detection and input-output entailment measurements, fail to capture the model’s internal state during answer generation. This paper introduces a novel approach to predict the correctness of the generated answer by modeling the model’s uncertainty on quantified perturbations of input. Extensive experiments across multiple large language models (LLMs) demonstrate that our approach quantifies RAG robustness by aligning predictions with ground truth with a Avg.Mean Square Error (MSE) 0.002 while offering flexibility for diverse qualitative metrics.

2024

pdf bib
Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Yanda Li | Chi Zhang | Gang Yu | Wanqi Yang | Zhibin Wang | Bin Fu | Guosheng Lin | Chunhua Shen | Ling Chen | Yunchao Wei
Findings of the Association for Computational Linguistics: ACL 2024

The remarkable multimodal capabilities demonstrated by OpenAI’s GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions.Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities.

2022

pdf bib
Empathetic and Emotionally Positive Conversation Systems with an Emotion-specific Query-Response Memory
Zhiliang Tian | Yinliang Wang | Yiping Song | Chi Zhang | Dongkyu Lee | Yingxiu Zhao | Dongsheng Li | Nevin L. Zhang
Findings of the Association for Computational Linguistics: EMNLP 2022

Emotional conversation systems generate responses for the input queries considering the speaker’s emotions in a conversation. Existing emotional conversation systems output emotional responses according to either a given emotion or the user’s emotion reflected in the input queries. Following a given emotion may lead to an emotional drift between the given emotion and the conversation state, and following only the user’s emotion may aggravate the user’s negative feelings if users suffer from a negative mood. In this paper, we propose to generate empathetic responses catering to the user’s emotions while leading the conversation to be emotionally positive. Particularly, by abstracting the conversation corpus, we extract and store the different responding strategies for different users’ emotions and conversational topics into a memory. We encourage positive emotions in conversation via a sentiment evaluator. We model the memory outputs with a Gaussian mixture distribution and sample a final responding strategy from the distribution. The strategy acts as a condition to a transformer model to generate responses. The experiments verify our model surpasses the baseline methods in appropriateness, diversity, and generating emotionally positive responses.