Che Liu


2024

pdf
Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models
Yuchong Sun | Che Liu | Kun Zhou | Jinwen Huang | Ruihua Song | Xin Zhao | Fuzheng Zhang | Di Zhang | Kun Gai
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans often interact with large language models (LLMs) in multi-turn interaction to obtain desired answers or more information. However, most existing studies overlook the multi-turn instruction following ability of LLMs, in terms of training dataset, training method, and evaluation benchmark. In this paper, we introduce Parrot, a solution aiming to enhance multi-turn instruction following for LLMs. First, we introduce an efficient but effective method for collecting multi-turn instructions that feature human-like queries, such as anaphora and ellipsis. Second, we propose a context-aware preference optimization strategy to further enhance LLMs for complex queries in multi-turn interaction. Moreover, to quantitatively evaluate LLMs in multi-turn instruction following, we manually build a multi-turn benchmark derived from existing ones. Extensive experiments show that Parrot improves current LLMs by up to 7.2% in multi-turn instruction following. Our dataset and codes will be open-sourced to facilitate future research.

pdf
DialogBench: Evaluating LLMs as Human-like Dialogue Systems
Jiao Ou | Junda Lu | Che Liu | Yihong Tang | Fuzheng Zhang | Di Zhang | Kun Gai
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning,which refreshes human impressions of dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive tests on English and Chinese DialogBench of 26 LLMs show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems. Interestingly, results also show that the positioning of assistant AI can make instruction tuning weaken the human emotional perception of LLMs and their mastery of information about human daily life.

2022

pdf
Dial2vec: Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings
Che Liu | Rui Wang | Junfeng Jiang | Yongbin Li | Fei Huang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In this paper, we introduce the task of learning unsupervised dialogue embeddings.Trivial approaches such as combining pre-trained word or sentence embeddings and encoding through pre-trained language models (PLMs) have been shown to be feasible for this task.However, these approaches typically ignore the conversational interactions between interlocutors, resulting in poor performance.To address this issue, we proposed a self-guided contrastive learning approach named dial2vec.Dial2vec considers a dialogue as an information exchange process.It captures the interaction patterns between interlocutors and leverages them to guide the learning of the embeddings corresponding to each interlocutor.Then the dialogue embedding is obtained by an aggregation of the embeddings from all interlocutors.To verify our approach, we establish a comprehensive benchmark consisting of six widely-used dialogue datasets.We consider three evaluation tasks: domain categorization, semantic relatedness, and dialogue retrieval.Dial2vec achieves on average 8.7, 9.0, and 13.8 points absolute improvements in terms of purity, Spearman’s correlation, and mean average precision (MAP) over the strongest baseline on the three tasks respectively.Further analysis shows that dial2vec obtains informative and discriminative embeddings for both interlocutors under the guidance of the conversational interactions and achieves the best performance when aggregating them through the interlocutor-level pooling strategy.All codes and data are publicly available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/dial2vec.

2021

pdf
DialogueCSE: Dialogue-based Contrastive Learning of Sentence Embeddings
Che Liu | Rui Wang | Jinghua Liu | Jian Sun | Fei Huang | Luo Si
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Learning sentence embeddings from dialogues has drawn increasing attention due to its low annotation cost and high domain adaptability. Conventional approaches employ the siamese-network for this task, which obtains the sentence embeddings through modeling the context-response semantic relevance by applying a feed-forward network on top of the sentence encoders. However, as the semantic textual similarity is commonly measured through the element-wise distance metrics (e.g. cosine and L2 distance), such architecture yields a large gap between training and evaluating. In this paper, we propose DialogueCSE, a dialogue-based contrastive learning approach to tackle this issue. DialogueCSE first introduces a novel matching-guided embedding (MGE) mechanism, which generates a context-aware embedding for each candidate response embedding (i.e. the context-free embedding) according to the guidance of the multi-turn context-response matching matrices. Then it pairs each context-aware embedding with its corresponding context-free embedding and finally minimizes the contrastive loss across all pairs. We evaluate our model on three multi-turn dialogue datasets: the Microsoft Dialogue Corpus, the Jing Dong Dialogue Corpus, and the E-commerce Dialogue Corpus. Evaluation results show that our approach significantly outperforms the baselines across all three datasets in terms of MAP and Spearman’s correlation measures, demonstrating its effectiveness. Further quantitative experiments show that our approach achieves better performance when leveraging more dialogue context and remains robust when less training data is provided.