Xiang Zhou


2023

pdf
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Archiki Prasad | Peter Hase | Xiang Zhou | Mohit Bansal
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large language models in a zero-shot setting. Recent work has aimed to improve such prompts via manual rewriting or gradient-based tuning. However, manual rewriting is time-consuming and requires subjective interpretation, while gradient-based tuning can be extremely computationally demanding for large models and may not be feasible for API-based models. In this work, we introduce Gradient-free Instructional Prompt Search (GrIPS), a gradient-free, edit-based search approach for improving task instructions for large language models. GrIPS takes in instructions designed for humans and automatically returns an improved, edited prompt, while allowing for API-based tuning. With InstructGPT models, GrIPS improves the average task performance by up to 4.30 percentage points on eight classification tasks from the Natural Instructions dataset (with similar improvements for OPT, BLOOM, and FLAN-T5). We see improvements for both instruction-only prompts and instruction + k-shot examples prompts. Notably, GrIPS outperforms manual rewriting and purely example-based prompts while controlling for the available compute and data budget. Further, performance of GrIPS is comparable to select gradient-based tuning approaches. Qualitatively, we show our edits can simplify instructions and at times make them incoherent but nonetheless improve accuracy.

2022

pdf
Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?
Xiang Zhou | Shiyue Zhang | Mohit Bansal
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model’s ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges.

pdf
Distributed NLI: Learning to Predict Human Opinion Distributions for Language Reasoning
Xiang Zhou | Yixin Nie | Mohit Bansal
Findings of the Association for Computational Linguistics: ACL 2022

We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that by applying additional distribution estimation methods, namely, Monte Carlo (MC) Dropout, Deep Ensemble, Re-Calibration, and Distribution Distillation, models can capture human judgement distribution more effectively than the softmax baseline. We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re-Calibration can give further improvements with extra distribution annotations, suggesting the value of multiple annotations for one example in modeling the distribution of human judgements. Despite these improvements, the best results are still far below the estimated human upper-bound, indicating that predicting the distribution of human judgements is still an open, challenging problem with a large room for improvements. We showcase the common errors for MC Dropout and Re-Calibration. Finally, we give guidelines on the usage of these methods with different levels of data availability and encourage future work on modeling the human opinion distribution for language reasoning.

pdf
Improving the Adversarial Robustness of NLP Models by Information Bottleneck
Cenyuan Zhang | Xiang Zhou | Yixin Wan | Xiaoqing Zheng | Kai-Wei Chang | Cho-Jui Hsieh
Findings of the Association for Computational Linguistics: ACL 2022

Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.

pdf
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality
Yichen Jiang | Xiang Zhou | Mohit Bansal
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. In this work, we analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias (one target sequence can only be mapped to one source sequence), and the tendency to memorize whole examples rather than separating structures from contents. We propose two techniques to address these two issues respectively: Mutual Exclusivity Training that prevents the model from producing seen generations when facing novel examples via an unlikelihood-based loss, and prim2primX data augmentation that automatically diversifies the arguments of every syntactic function to prevent memorizing and provide a compositional inductive bias without exposing test-set data. Combining these two techniques, we show substantial empirical improvements using standard sequence-to-sequence models (LSTMs and Transformers) on two widely-used compositionality datasets: SCAN and COGS. Finally, we provide analysis characterizing the improvements as well as the remaining challenges, and provide detailed ablations of our method.

2021

pdf
Hidden Biases in Unreliable News Detection Datasets
Xiang Zhou | Heba Elfardy | Christos Christodoulopoulos | Thomas Butler | Mohit Bansal
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Automatic unreliable news detection is a research problem with great potential impact. Recently, several papers have shown promising results on large-scale news datasets with models that only use the article itself without resorting to any fact-checking mechanism or retrieving any supporting evidence. In this work, we take a closer look at these datasets. While they all provide valuable resources for future research, we observe a number of problems that may lead to results that do not generalize in more realistic settings. Specifically, we show that selection bias during data collection leads to undesired artifacts in the datasets. In addition, while most systems train and predict at the level of individual articles, overlapping article sources in the training and evaluation data can provide a strong confounding factor that models can exploit. In the presence of this confounding factor, the models can achieve good performance by directly memorizing the site-label mapping instead of modeling the real task of unreliable news detection. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.

pdf
面向法律文本的实体关系联合抽取算法(Joint Entity and Relation Extraction for Legal Texts)
Wenhui Song (宋文辉) | Xiang Zhou (周翔) | Ping Yang (杨萍) | Yuanyuan Sun (孙媛媛) | Liang Yang (杨亮) | Hongfei Lin (林鸿飞)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

法律文本中包含的丰富信息可以通过结构化的实体关系三元组进行表示,便于法律知识的存储和查询。传统的流水线方法在自动抽取三元组时执行了大量冗余计算,造成了误差传播。而现有的联合学习方法无法适用于有大量重叠关系的法律文本,也并未关注语法结构信息对文本表示的增强,因此本文提出一种面向法律文本的实体关系联合抽取模型。该模型首先通过ON-LSTM注入语法信息,然后引入多头注意力机制分解重叠关系。相较于流水线和其他联合学习方法本文模型抽取效果最佳,在涉毒类法律文本数据集上抽取结果的F1值达到78.7%。

2020

pdf
Towards Robustifying NLI Models Against Lexical Dataset Biases
Xiang Zhou | Mohit Bansal
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

While deep learning models are making fast progress on the task of Natural Language Inference, recent studies have also shown that these models achieve high accuracy by exploiting several dataset biases, and without deep understanding of the language semantics. Using contradiction-word bias and word-overlapping bias as our two bias examples, this paper explores both data-level and model-level debiasing methods to robustify models against lexical dataset biases. First, we debias the dataset through data augmentation and enhancement, but show that the model bias cannot be fully removed via this method. Next, we also compare two ways of directly debiasing the model without knowing what the dataset biases are in advance. The first approach aims to remove the label bias at the embedding level. The second approach employs a bag-of-words sub-model to capture the features that are likely to exploit the bias and prevents the original model from learning these biased features by forcing orthogonality between these two sub-models. We performed evaluations on new balanced datasets extracted from the original MNLI dataset as well as the NLI stress tests, and show that the orthogonality approach is better at debiasing the model while maintaining competitive overall accuracy.

pdf
The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions
Xiang Zhou | Yixin Nie | Hao Tan | Mohit Bansal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve. We also observe lower-than-expected correlations between the analysis validation set and standard validation set, questioning the effectiveness of the current model-selection routine. Next, to answer the second question, we give both theoretical explanations and empirical evidence regarding the source of the instability, demonstrating that the instability mainly comes from high inter-example correlations within analysis sets. Finally, for the third question, we discuss an initial attempt to mitigate the instability and suggest guidelines for future work such as reporting the decomposed variance for more interpretable results and fair comparison across models.

pdf
What Can We Learn from Collective Human Opinions on Natural Language Inference Data?
Yixin Nie | Xiang Zhou | Mohit Bansal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in αNLI. Analysis reveals that: (1) high human disagreement exists in a noticeable amount of examples in these datasets; (2) the state-of-the-art models lack the ability to recover the distribution over human labels; (3) models achieve near-perfect accuracy on the subset of data with a high level of human agreement, whereas they can barely beat a random guess on the data with low levels of human agreement, which compose most of the common errors made by state-of-the-art models on the evaluation sets. This questions the validity of improving model performance on old metrics for the low-agreement part of evaluation datasets. Hence, we argue for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions.

2017

pdf
On-line Dialogue Policy Learning with Companion Teaching
Lu Chen | Runzhe Yang | Cheng Chang | Zihao Ye | Xiang Zhou | Kai Yu
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

On-line dialogue policy learning is the key for building evolvable conversational agent in real world scenarios. Poor initial policy can easily lead to bad user experience and consequently fail to attract sufficient users for policy training. A novel framework, companion teaching, is proposed to include a human teacher in the dialogue policy training loop to address the cold start problem. Here, dialogue policy is trained using not only user’s reward, but also teacher’s example action as well as estimated immediate reward at turn level. Simulation experiments showed that, with small number of human teaching dialogues, the proposed approach can effectively improve user experience at the beginning and smoothly lead to good performance with more user interaction data.

pdf
Affordable On-line Dialogue Policy Learning
Cheng Chang | Runzhe Yang | Lu Chen | Xiang Zhou | Kai Yu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The key to building an evolvable dialogue system in real-world scenarios is to ensure an affordable on-line dialogue policy learning, which requires the on-line learning process to be safe, efficient and economical. But in reality, due to the scarcity of real interaction data, the dialogue system usually grows slowly. Besides, the poor initial dialogue policy easily leads to bad user experience and incurs a failure of attracting users to contribute training data, so that the learning process is unsustainable. To accurately depict this, two quantitative metrics are proposed to assess safety and efficiency issues. For solving the unsustainable learning problem, we proposed a complete companion teaching framework incorporating the guidance from the human teacher. Since the human teaching is expensive, we compared various teaching schemes answering the question how and when to teach, to economically utilize teaching budget, so that make the online learning process affordable.

pdf
Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning
Lu Chen | Xiang Zhou | Cheng Chang | Runzhe Yang | Kai Yu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Hand-crafted rules and reinforcement learning (RL) are two popular choices to obtain dialogue policy. The rule-based policy is often reliable within predefined scope but not self-adaptable, whereas RL is evolvable with data but often suffers from a bad initial performance. We employ a companion learning framework to integrate the two approaches for on-line dialogue policy learning, in which a pre-defined rule-based policy acts as a “teacher” and guides a data-driven RL system by giving example actions as well as additional rewards. A novel agent-aware dropout Deep Q-Network (AAD-DQN) is proposed to address the problem of when to consult the teacher and how to learn from the teacher’s experiences. AAD-DQN, as a data-driven student policy, provides (1) two separate experience memories for student and teacher, (2) an uncertainty estimated by dropout to control the timing of consultation and learning. Simulation experiments showed that the proposed approach can significantly improve both safetyand efficiency of on-line policy optimization compared to other companion learning approaches as well as supervised pre-training using static dialogue corpus.