Yaqian Zhou


2025

pdf bib
MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time
Mozhi Zhang | Pengyu Wang | Chenkun Tan | Mianqiu Huang | Dong Zhang | Yaqian Zhou | Xipeng Qiu
Findings of the Association for Computational Linguistics: NAACL 2025

Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential. Existing alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), typically embed predefined preferences directly within the model’s parameters. These methods, however, often result in a static alignment that can not account for the diversity of human preferences in practical applications.In response to this challenge, we propose an effective method, MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time. Experimental results show that LLMs optimized on our meticulously constructed MetaAlign Dataset can effectively align with any preferences specified at the inference stage, validating the feasibility of MetaAlign. We hope that our work can provide some insights into the alignment of language models.

pdf bib
FiNE: Filtering and Improving Noisy Data Elaborately with Large Language Models
Junliang He | Ziyue Fan | Shaohui Kuang | Li Xiaoqing | Kai Song | Yaqian Zhou | Xipeng Qiu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Data is the lifeblood of large language models (LLMs). While the quantity of open-source data available for training LLMs is substantial, its integrity often falls short. For instance, the open-source chat version of Yi-1.5-9B scores 5.20 on AlignBench, while the Chinese Alpaca-GPT4 version scores 4.12. This discrepancy makes it challenging for developers to create models that excel in downstream tasks and instruction following. Therefore, it is essential to improve data integrity. Currently, there are two mainstream methods for enhancing data integrity: data filtering and data augmentation. Due to the labor-intensive and time-consuming nature of performing these tasks manually, some of these efforts are now being undertaken by LLMs, owing to their high alignment with human preferences. However, we have found that performing data filtering or data augmentation with LLMs has limited effectiveness in improving data integrity. In this work, we propose FiNE (Filtering and Improving Noisy data Elaborately), a method that performs refined filtering and improvement of training data with LLMs. Using the data obtained through our method to train Yi-1.5-9B, the performance gap on AlignBench between our model and the open-source chat version is reduced from 1.08 to 0.35. Additionally, on HalluQA, our model surpasses the open-source chat version by 8.45.

pdf bib
CAMIEval: Enhancing NLG Evaluation through Multidimensional Comparative Instruction-Following Analysis
Ziyue Fan | Junliang He | Li Xiaoqing | Shaohui Kuang | Kai Song | Yaqian Zhou | Xipeng Qiu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

With the rapid development of large language models (LLMs), due to their strong performance across various fields, LLM-based evaluation methods (LLM-as-a-Judge) have become widely used in natural language generation (NLG) evaluation. However, these methods encounter the following challenges: (1) distinguishing instruction-following ability, (2) being applicable across diverse NLG tasks, and (3) identifying low-quality outputs. To address these issues, we propose CAMIEval, a multidimensional comparative evaluation method based on instruction-following. Specifically, we define three fundamental dimensions of instruction-following: relevance, factuality, and adherence. Subsequently, we introduce a concrete Chain-of-Thoughts (ConcreteCoT) process to enhance the accuracy of evaluations. In addition, we trained a “regrettable model” RegretLM to generate low-quality outputs, which helps the evaluator better identify the potential shortcomings of the candidate output by comparing low-quality outputs with reference outputs. Through this comparison, the evaluator can generate instruction-specific dimensions that complement the fundamental dimensions, forming a more comprehensive evaluation metric system. Experiments on two NLG evaluation benchmarks demonstrate that CAMIEval consistently outperforms existing methods in terms of correlation with human evaluations, providing a general and accurate framework for evaluating the outputs of LLMs.

2024

pdf bib
Calibrating the Confidence of Large Language Models by Eliciting Fidelity
Mozhi Zhang | Mianqiu Huang | Rundong Shi | Linsen Guo | Chong Peng | Peng Yan | Yaqian Zhou | Xipeng Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models optimized with techniques like RLHF have achieved good alignment in being helpful and harmless. However, post-alignment, these language models often exhibit overconfidence, where the expressed confidence does not accurately calibrate with their correctness rate. In this paper, we decompose the language model confidence into the Uncertainty about the question and the Fidelity to the answer generated by language models. Then, we propose a plug-and-play method, UF Calibration, to estimate the confidence of language models. Our method has shown good calibration performance by conducting experiments with 6 RLHF-LMs on four MCQA datasets. Moreover, we propose two novel metrics, IPR and CE, to evaluate the calibration of the model, and we have conducted a detailed discussion on Truly Well-Calibrated Confidence for large language models. Our method could serve as a strong baseline, and we hope that this work will provide some insights into the model confidence calibration.

2023

pdf bib
DUB: Discrete Unit Back-translation for Speech Translation
Dong Zhang | Rong Ye | Tom Ko | Mingxuan Wang | Yaqian Zhou
Findings of the Association for Computational Linguistics: ACL 2023

How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST.Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation(DUB) to answer two questions (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at https://anonymous.4open.science/r/DUB/.

pdf bib
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Dong Zhang | Shimin Li | Xin Zhang | Jun Zhan | Pengyu Wang | Yaqian Zhou | Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2023

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-modal content. With discrete speech representations, we construct SpeechInstruct, the first large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow cross-modal human instructions and highlight the potential of handling multiple modalities with one model. Code and models are available in https://github.com/0nutation/SpeechGPT. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

2021

pdf bib
SENT: Sentence-level Distant Relation Extraction via Negative Training
Ruotian Ma | Tao Gui | Linyang Li | Qi Zhang | Xuanjing Huang | Yaqian Zhou
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Distant supervision for relation extraction provides uniform bag labels for each sentence inside the bag, while accurate sentence labels are important for downstream applications that need the exact relation type. Directly using bag labels for sentence-level training will introduce much noise, thus severely degrading performance. In this work, we propose the use of negative training (NT), in which a model is trained using complementary labels regarding that “the instance does not belong to these complementary labels”. Since the probability of selecting a true label as a complementary label is low, NT provides less noisy information. Furthermore, the model trained with NT is able to separate the noisy data from the training data. Based on NT, we propose a sentence-level framework, SENT, for distant relation extraction. SENT not only filters the noisy data to construct a cleaner dataset, but also performs a re-labeling process to transform the noisy data into useful training data, thus further benefiting the model’s performance. Experimental results show the significant improvement of the proposed method over previous methods on sentence-level evaluation and de-noise effect.

pdf bib
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing
Xiao Wang | Qin Liu | Tao Gui | Qi Zhang | Yicheng Zou | Xin Zhou | Jiacheng Ye | Yongxin Zhang | Rui Zheng | Zexiong Pang | Qinzhuo Wu | Zhengyan Li | Chong Zhang | Ruotian Ma | Zichu Fei | Ruijian Cai | Jun Zhao | Xingwu Hu | Zhiheng Yan | Yiding Tan | Yuan Hu | Qiyuan Bian | Zhihua Liu | Shan Qin | Bolin Zhu | Xiaoyu Xing | Jinlan Fu | Yue Zhang | Minlong Peng | Xiaoqing Zheng | Yaqian Zhou | Zhongyu Wei | Xipeng Qiu | Xuanjing Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analyses. This enables practitioners to automatically evaluate their models from various aspects or to customize their evaluations as desired with just a few lines of code. TextFlint also generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model in terms of its robustness. To guarantee acceptability, all the text transformations are linguistically based and all the transformed data selected (up to 100,000 texts) scored highly under human evaluation. To validate the utility, we performed large-scale empirical evaluations (over 67,000) on state-of-the-art deep learning models, classic supervised methods, and real-world systems. The toolkit is already available at https://github.com/textflint with all the evaluation results demonstrated at textflint.io.

pdf bib
Iterative GNN-based Decoder for Question Generation
Zichu Fei | Qi Zhang | Yaqian Zhou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Natural question generation (QG) aims to generate questions from a passage, and generated questions are answered from the passage. Most models with state-of-the-art performance model the previously generated text at each decoding step. However, (1) they ignore the rich structure information that is hidden in the previously generated text. (2) they ignore the impact of copied words on the passage. We perceive that information in previously generated words serves as auxiliary information in subsequent generation. To address these problems, we design the Iterative Graph Network-based Decoder (IGND) to model the previous generation using a Graph Neural Network at each decoding step. Moreover, our graph model captures dependency relations in the passage that boost the generation. Experimental results demonstrate that our model outperforms the state-of-the-art models with sentence-level QG tasks on SQuAD and MARCO datasets.

pdf bib
A Relation-Oriented Clustering Method for Open Relation Extraction
Jun Zhao | Tao Gui | Qi Zhang | Yaqian Zhou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The clustering-based unsupervised relation discovery method has gradually become one of the important methods of open relation extraction (OpenRE). However, high-dimensional vectors can encode complex linguistic information which leads to the problem that the derived clusters cannot explicitly align with the relational semantic classes. In this work, we propose a relation-oriented clustering model and use it to identify the novel relations in the unlabeled data. Specifically, to enable the model to learn to cluster relational data, our method leverages the readily available labeled data of pre-defined relations to learn a relation-oriented representation. We minimize distance between the instance with same relation by gathering the instances towards their corresponding relation centroids to form a cluster structure, so that the learned representation is cluster-friendly. To reduce the clustering bias on predefined classes, we optimize the model by minimizing a joint objective on both labeled and unlabeled data. Experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two datasets respectively, compared with current SOTA methods.

2018

pdf bib
Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning
Yucheng Wang | Zhongyu Wei | Yaqian Zhou | Xuanjing Huang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Automatic essay scoring (AES) is the task of assigning grades to essays without human interference. Existing systems for AES are typically trained to predict the score of each single essay at a time without considering the rating schema. In order to address this issue, we propose a reinforcement learning framework for essay scoring that incorporates quadratic weighted kappa as guidance to optimize the scoring system. Experiment results on benchmark datasets show the effectiveness of our framework.

2016

pdf bib
Generating Abbreviations for Chinese Named Entities Using Recurrent Neural Network with Dynamic Dictionary
Qi Zhang | Jin Qian | Ya Guo | Yaqian Zhou | Xuanjing Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Modelling Interaction of Sentence Pair with Coupled-LSTMs
Pengfei Liu | Xipeng Qiu | Yaqian Zhou | Jifan Chen | Xuanjing Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Transition-based Dependency Parsing Using Two Heterogeneous Gated Recursive Neural Networks
Xinchi Chen | Yaqian Zhou | Chenxi Zhu | Xipeng Qiu | Xuanjing Huang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
A Generative Model for Identifying Target Companies of Microblogs
Yeyun Gong | Yaqian Zhou | Ya Guo | Qi Zhang | Xuanjing Huang
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Detecting Spammers in Community Question Answering
Zhuoye Ding | Yeyun Gong | Yaqian Zhou | Qi Zhang | Xuanjing Huang
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2010

pdf bib
Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
Xian Qian | Qi Zhang | Yaqian Zhou | Xuanjing Huang | Lide Wu
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
CRF-based Hybrid Model for Word Segmentation, NER and even POS Tagging
Zhiting Xu | Xian Qian | Yuejie Zhang | Yaqian Zhou
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing

2005

pdf bib
Answering Definition Questions Using Web Knowledge Bases
Zhushuo Zhang | Yaqian Zhou | Xuanjing Huang | Lide Wu
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Transformation Based Chinese Entity Detection and Tracking
Yaqian Zhou | Changning Huang | Jianfeng Gao | Lide Wu
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

2003

pdf bib
A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling
Yaqian Zhou | Fuliang Weng | Lide Wu | Hauke Schmidt
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing