Chen Zhao


2024

pdf
Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong
Chenglei Si | Navita Goyal | Tongshuang Wu | Chen Zhao | Shi Feng | Hal Daumé Iii | Jordan Boyd-Graber
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. We conduct human experiments with 80 crowdworkers to compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information—explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users’ over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

2023

pdf bib
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
Burcu Can | Maximilian Mozes | Samuel Cahyawijaya | Naomi Saphra | Nora Kassner | Shauli Ravfogel | Abhilasha Ravichander | Chen Zhao | Isabelle Augenstein | Anna Rogers | Kyunghyun Cho | Edward Grefenstette | Lena Voita
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

pdf
On the Relation between Sensitivity and Accuracy in In-Context Learning
Yanda Chen | Chen Zhao | Zhou Yu | Kathleen McKeown | He He
Findings of the Association for Computational Linguistics: EMNLP 2023

In-context learning (ICL) suffers from oversensitivity to the prompt, making it unreliable in real-world scenarios. We study the sensitivity of ICL with respect to multiple perturbation types. First, we find that label bias obscures the true sensitivity, and therefore prior work may have significantly underestimated ICL sensitivity. Second, we observe a strong negative correlation between ICL sensitivity and accuracy: predictions sensitive to perturbations are less likely to be correct. Motivated by these findings, we propose SenSel, a few-shot selective prediction method that abstains from sensitive predictions. Experiments on ten classification datasets show that SenSel consistently outperforms two commonly used confidence-based and entropy-based baselines on abstention decisions.

pdf
Getting MoRE out of Mixture of Language Model Reasoning Experts
Chenglei Si | Weijia Shi | Chen Zhao | Luke Zettlemoyer | Jordan Boyd-Graber
Findings of the Association for Computational Linguistics: EMNLP 2023

While recent large language models (LLMs) improve on various question answering (QA) datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. We provide empirical evidence that state-of-the-art LLMs suffer from poor generalizability on reasoning types beyond those seen in the prompt. To remedy this, we propose a Mixture-of-Reasoning-Experts (MORE) framework that ensembles diverse specialized language models. We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning. Our key insight is to leverage agreement among the specialized experts to select the best answer for each question, or to abstain from answering. This gives MORE higher accuracy than any single specialized model on a collection of 12 QA datasets from four reasoning types. Beyond generalizability, the interpretable design of MORE improves selective question answering results compared to baselines without incorporating inter-expert agreement. This framework is also more interpretable and useful to human consumers of QA outputs. Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system’s output. We release all code and data to facilitate future work.

pdf
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations
Yilun Zhao | Chen Zhao | Linyong Nan | Zhenting Qi | Wenlin Zhang | Xiangru Tang | Boyu Mi | Dragomir Radev
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite significant progress having been made in question answering on tabular data (Table QA), it’s unclear whether, and to what extent existing Table QA models are robust to task-specific perturbations, e.g., replacing key question entities or shuffling table columns. To systematically study the robustness of Table QA models, we propose a benchmark called RobuT, which builds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) and includes human-annotated adversarial perturbations in terms of table header, table content, and question. Our results indicate that both state-of-the-art Table QA models and large language models (e.g., GPT-3) with few-shot learning falter in these adversarial sets. We propose to address this problem by using large language models to generate adversarial examples to enhance training, which significantly improves the robustness of Table QA models.

pdf
Retrieval-Augmented Chain-of-Thought in Semi-structured Domains
Vaibhav Mavi | Abulhair Saparov | Chen Zhao
Proceedings of the Natural Legal Language Processing Workshop 2023

Applying existing question answering (QA) systems to specialized domains like law and finance presents challenges that necessitate domain expertise. Although large language models (LLMs) have shown impressive language comprehension and in-context learning capabilities, their inability to handle very long inputs/contexts is well known. Tasks specific to these domains need significant background knowledge, leading to contexts that can often exceed the maximum length that existing LLMs can process. This study explores leveraging the semi-structured nature of legal and financial data to efficiently retrieve relevant context, enabling the use of LLMs for domain-specialized QA. The resulting system outperforms contemporary models and also provides useful explanations for the answers, encouraging the integration of LLMs into legal and financial NLP systems for future research.

2022

pdf
Bridging the Generalization Gap in Text-to-SQL Parsing with Schema Expansion
Chen Zhao | Yu Su | Adam Pauls | Emmanouil Antonios Platanios
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text-to-SQL parsers map natural language questions to programs that are executable over tables to generate answers, and are typically evaluated on large-scale datasets like Spider (Yu et al., 2018). We argue that existing benchmarks fail to capture a certain out-of-domain generalization problem that is of significant practical importance: matching domain specific phrases to composite operation over columns. To study this problem, we first propose a synthetic dataset along with a re-purposed train/test split of the Squall dataset (Shi et al., 2020) as new benchmarks to quantify domain generalization over column operations, and find existing state-of-the-art parsers struggle in these benchmarks. We propose to address this problem by incorporating prior domain knowledge by preprocessing table schemas, and design a method that consists of two components: schema expansion and schema pruning. This method can be easily applied to multiple existing base parsers, and we show that it significantly outperforms baseline parsers on this domain generalization problem, boosting the underlying parsers’ overall performance by up to 13.8% relative accuracy gain (5.1% absolute) on the new Squall data split.

pdf
Re-Examining Calibration: The Case of Question Answering
Chenglei Si | Chen Zhao | Sewon Min | Jordan Boyd-Graber
Findings of the Association for Computational Linguistics: EMNLP 2022

For users to trust model predictions, they need to understand model outputs, particularly their confidence — calibration aims to adjust (calibrate) models’ confidence to match expected accuracy. We argue that the traditional calibration evaluation does not promote effective calibrations: for example, it can encourage always assigning a mediocre confidence score to all predictions, which does not help users distinguish correct predictions from wrong ones. Building on those observations, we propose a new calibration metric, MacroCE, that better captures whether the model assigns low confidence to wrong predictions and high confidence to correct predictions. Focusing on the practical application of open-domain question answering, we examine conventional calibration methods applied on the widely-used retriever-reader pipeline, all of which do not bring significant gains under our new MacroCE metric. Toward better calibration, we propose a new calibration method (ConsCal) that uses not just final model predictions but whether multiple model checkpoints make consistent predictions. Altogether, we provide an alternative view of calibration along with a new metric, re-evaluation of existing calibration methods on our metric, and proposal of a more effective calibration method.

2021

pdf
Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval
Chen Zhao | Chenyan Xiong | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BeamDR) that iteratively forms an evidence chain through beam search in dense representations. When evaluated on multi-hop question answering, BeamDR is competitive to state-of-the-art systems, without using any semi-structured information. Through query composition in dense space, BeamDR captures the implicit relationships between evidence in the reasoning chain. The code is available at https://github.com/henryzhao5852/BeamDR.

pdf
Distantly-Supervised Dense Retrieval Enables Open-Domain Question Answering without Evidence Annotation
Chen Zhao | Chenyan Xiong | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Open-domain question answering answers a question based on evidence retrieved from a large corpus. State-of-the-art neural approaches require intermediate evidence annotations for training. However, such intermediate annotations are expensive, and methods that rely on them cannot transfer to the more common setting, where only question–answer pairs are available. This paper investigates whether models can learn to find evidence from a large corpus, with only distant supervision from answer labels for model training, thereby generating no additional annotation cost. We introduce a novel approach (DistDR) that iteratively improves over a weak retriever by alternately finding evidence from the up-to-date model and encouraging the model to learn the most likely evidence. Without using any evidence labels, DistDR is on par with fully-supervised state-of-the-art methods on both multi-hop and single-hop QA benchmarks. Our analysis confirms that DistDR finds more accurate evidence over iterations, which leads to model improvements. The code is available at https://github.com/henryzhao5852/DistDR.

pdf
What’s in a Name? Answer Equivalence For Open-Domain Question Answering
Chenglei Si | Chen Zhao | Jordan Boyd-Graber
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A flaw in QA evaluation is that annotations often only provide one gold answer. Thus, model predictions semantically equivalent to the answer but superficially different are considered incorrect. This work explores mining alias entities from knowledge bases and using them as additional gold answers (i.e., equivalent answers). We incorporate answers for two settings: evaluation with additional answers and model training with equivalent answers. We analyse three QA benchmarks: Natural Questions, TriviaQA, and SQuAD. Answer expansion increases the exact match score on all datasets for evaluation, while incorporating it helps model training over real-world datasets. We ensure the additional answers are valid through a human post hoc evaluation.

2020

pdf
On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries
Tianze Shi | Chen Zhao | Jordan Boyd-Graber | Hal Daumé III | Lillian Lee
Findings of the Association for Computational Linguistics: EMNLP 2020

Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce SQUALL, a dataset that enriches 11,276 WIKITABLEQUESTIONS English-language questions with manually created SQL equivalents plus alignments between SQL and question fragments. Our annotation enables new training possibilities for encoderdecoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In 5-fold cross validation, these strategies improve over strong baselines by 4.4% execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to 23.9%.

2018

pdf
A dataset and baselines for sequential open-domain question answering
Ahmed Elgohary | Chen Zhao | Jordan Boyd-Graber
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Previous work on question-answering systems mainly focuses on answering individual questions, assuming they are independent and devoid of context. Instead, we investigate sequential question answering, asking multiple related questions. We present QBLink, a new dataset of fully human-authored questions. We extend existing strong question answering frameworks to include previous questions to improve the overall question-answering accuracy in open-domain question answering. The dataset is publicly available at http://sequential.qanta.org.

2016

pdf
Analyzing Time Series Changes of Correlation between Market Share and Concerns on Companies measured through Search Engine Suggests
Takakazu Imada | Yusuke Inoue | Lei Chen | Syunya Doi | Tian Nie | Chen Zhao | Takehito Utsuro | Yasuhide Kawada
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper proposes how to utilize a search engine in order to predict market shares. We propose to compare rates of concerns of those who search for Web pages among several companies which supply products, given a specific products domain. We measure concerns of those who search for Web pages through search engine suggests. Then, we analyze whether rates of concerns of those who search for Web pages have certain correlation with actual market share. We show that those statistics have certain correlations. We finally propose how to predict the market share of a specific product genre based on the rates of concerns of those who search for Web pages.