Ziyi Liu


2023

pdf
Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-text Rationales
Brihi Joshi | Ziyi Liu | Sahana Ramnath | Aaron Chan | Zhewei Tong | Shaoliang Nie | Qifan Wang | Yejin Choi | Xiang Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Among the remarkable emergent capabilities of large language models (LMs) is free-text rationalization; beyond certain scale, large LMs are capable of generating seemingly useful rationalizations, which in turn, can dramatically enhance their performances on leaderboards. This phenomenon raises a question: can machine generated rationales also be useful for humans, especially when lay humans try to answer questions based on those machine rationales? We observe that human utility of existing rationales is far from satisfactory and expensive to estimate with human studies. Existing metrics like task performance of the LM generating the rationales or similarity between generated and gold rationales are not good indicators of their human utility. While we observe that certain properties of rationales like conciseness and novelty are correlated with their human utility, estimating them without human involvement is challenging. We show that, by estimating a rationale’s helpfulness in answering similar unseen instances, we can measure its human utility to a better extent. We also translate this finding into an automated score, Gen-U, that we propose, which can help improve LMs’ ability to generate rationales with better human utility, while maintaining most of its task performance. Lastly, we release all code and collected data with this project.

pdf
XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models
Dong-Ho Lee | Akshen Kadakia | Brihi Joshi | Aaron Chan | Ziyi Liu | Kiran Narahari | Takashi Shibuya | Ryosuke Mitani | Toshiyuki Sekiya | Jay Pujara | Xiang Ren
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

NLP models are susceptible to learning spurious biases (i.e., bugs) that work on some datasets but do not properly reflect the underlying task. Explanation-based model debugging aims to resolve spurious biases by showing human users explanations of model behavior, asking users to give feedback on the behavior, thenusing the feedback to update the model. While existing model debugging methods have shown promise, their prototype-level implementations provide limited practical utility. Thus, we propose XMD: the first open-source, end-to-end framework for explanation-based model debugging. Given task- or instance-level explanations,users can flexibly provide various forms of feedback via an intuitive, web-based UI. After receiving user feedback, XMD automatically updates the model in real time, by regularizing the model so that its explanationsalign with the user feedback. The new model can then be easily deployed into real-world applications via Hugging Face. Using XMD, we can improve the model’s OOD performance on text classification tasks by up to 18%.

2022

pdf
AiM: Taking Answers in Mind to Correct Chinese Cloze Tests in Educational Applications
Yusen Zhang | Zhongli Li | Qingyu Zhou | Ziyi Liu | Chao Li | Mina Ma | Yunbo Cao | Hongzhi Liu
Proceedings of the 29th International Conference on Computational Linguistics

To automatically correct handwritten assignments, the traditional approach is to use an OCR model to recognize characters and compare them to answers. The OCR model easily gets confused on recognizing handwritten Chinese characters, and the textual information of the answers is missing during the model inference. However, teachers always have these answers in mind to review and correct assignments. In this paper, we focus on the Chinese cloze tests correction and propose a multimodal approach(named AiM). The encoded representations of answers interact with the visual information of students’ handwriting. Instead of predicting ‘right’ or ‘wrong’, we perform the sequence labeling on the answer text to infer which answer character differs from the handwritten content in a fine-grained way. We take samples of OCR datasets as the positive samples for this task, and develop a negative sample augmentation method to scale up the training data. Experimental results show that AiM outperforms OCR-based methods by a large margin. Extensive studies demonstrate the effectiveness of our multimodal approach.

pdf
ER-Test: Evaluating Explanation Regularization Methods for Language Models
Brihi Joshi | Aaron Chan | Ziyi Liu | Shaoliang Nie | Maziar Sanjabi | Hamed Firooz | Xiang Ren
Findings of the Association for Computational Linguistics: EMNLP 2022

By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (NLMs). Explanation regularization (ER) aims to improve NLM generalization by pushing the NLM’s machine rationales (Which input tokens did the NLM focus on?) to align with human rationales (Which input tokens would humans focus on). Though prior works primarily study ER via in-distribution (ID) evaluation, out-of-distribution (OOD) generalization is often more critical in real-world scenarios, yet ER’s effect on OOD generalization has been underexplored.In this paper, we introduce ER-Test, a framework for evaluating ER models’ OOD generalization along three dimensions: unseen datasets, contrast set tests, and functional tests. Using ER-Test, we comprehensively analyze how ER models’ OOD generalization varies with the rationale alignment criterion (loss function), human rationale type (instance-level v/s task-level), number and choice of rationale-annotated instances, and time budget for rationale annotation. Across two tasks and six datasets, we show that ER has little impact on ID performance but yields large OOD performance gains, with the best ER criterion being task-dependent. Also, ER can improve OOD performance even with task-level or few human rationales. Finally, we find that rationale annotation is more time-efficient than label annotation for improving OOD performance. Our results with ER-Test help demonstrate ER’s utility and establish best practices for using ER effectively.

2020

pdf
Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only
Ziyi Liu | Giannis Karamanolakis | Daniel Hsu | Luis Gravano
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

Health departments have been deploying text classification systems for the early detection of foodborne illness complaints in social media documents such as Yelp restaurant reviews. Current systems have been successfully applied for documents in English and, as a result, a promising direction is to increase coverage and recall by considering documents in additional languages, such as Spanish or Chinese. Training previous systems for more languages, however, would be expensive, as it would require the manual annotation of many documents for each new target language. To address this challenge, we consider cross-lingual learning and train multilingual classifiers using only the annotations for English-language reviews. Recent zero-shot approaches based on pre-trained multi-lingual BERT (mBERT) have been shown to effectively align languages for aspects such as sentiment. Interestingly, we show that those approaches are less effective for capturing the nuances of foodborne illness, our public health application of interest. To improve performance without extra annotations, we create artificial training documents in the target language through machine translation and train mBERT jointly for the source (English) and target language. Furthermore, we show that translating labeled documents to multiple languages leads to additional performance improvements for some target languages. We demonstrate the benefits of our approach through extensive experiments with Yelp restaurant reviews in seven languages. Our classifiers identify foodborne illness complaints in multilingual reviews from the Yelp Challenge dataset, which highlights the potential of our general approach for deployment in health departments.