Morteza Ziyadi


2023

pdf
INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
Anil Ramakrishna | Rahul Gupta | Jens Lehmann | Morteza Ziyadi
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent advancements in Large language models (LLMs) have enabled them to hold free form conversations over multiple turns, but they exhibit a tendency to make unfounded and incorrect statements, commonly known as hallucinations. In particular, LLMs hallucinate frequently when given invalid questions, i.e. ones with incorrect assumptions. The most common approach to evaluate LLMs on hallucinations is to test them on Question Answering (QA) test sets such as TruthfulQA. However, LLMs are increasingly pretrained on massive text corpora scraped from the Internet, which may inevitably expose these test sets to the model during training, leading eventually to an overestimation of model performances on these test sets. In this work, we present an alternative framework to address this risk and to foster further research towards making LLMs robust against invalid questions. We name our framework INVITE: a testbed of automatically generated INValId questions to evaluaTE large language models for hallucinations. In each instantiation, our framework is set up to create a fresh batch of invalid questions by distorting valid facts in which subjects or objects are replaced by similar entities. We evaluate several state of the art LLMs against a testset generated by our framework and highlight its capacity to trigger hallucinations in these models.

pdf
Entity Contrastive Learning in a Large-Scale Virtual Assistant System
Jonathan Rubin | Jason Crowley | George Leung | Morteza Ziyadi | Maria Minakova
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Conversational agents are typically made up of domain (DC) and intent classifiers (IC) that identify the general subject an utterance belongs to and the specific action a user wishes to achieve. In addition, named entity recognition (NER) performs per token labeling to identify specific entities of interest in a spoken utterance. We investigate improving joint IC and NER models using entity contrastive learning that attempts to cluster similar entities together in a learned representation space. We compare a full virtual assistant system trained using entity contrastive learning to a production baseline system that does not use contrastive learning. We present both offline results, using retrospective test sets, as well as live online results from an A/B test that compared the two systems. In both the offline and online settings, entity contrastive training improved overall performance against production baselines. Furthermore, we provide a detailed analysis of learned entity embeddings, including both qualitative analysis via dimensionality-reduced visualizations and quantitative analysis by computing alignment and uniformity metrics. We show that entity contrastive learning improves alignment metrics and produces well-formed embedding clusters in representation space.

2022

pdf
Reasoning Like Program Executors
Xinyu Pi | Qian Liu | Bei Chen | Morteza Ziyadi | Zeqi Lin | Qiang Fu | Yan Gao | Jian-Guang Lou | Weizhu Chen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Reasoning over natural language is a long-standing goal for the research community. However, studies have shown that existing language models are inadequate in reasoning. To address the issue, we present POET, a novel reasoning pre-training paradigm. Through pre-training language models with programs and their execution results, POET empowers language models to harvest the reasoning knowledge possessed by program executors via a data-driven approach. POET is conceptually simple and can be instantiated by different kinds of program executors. In this paper, we showcase two simple instances POET-Math and POET-Logic, in addition to a complex instance, POET-SQL. Experimental results on six benchmarks demonstrate that POET can significantly boost model performance in natural language reasoning, such as numerical reasoning, logical reasoning, and multi-hop reasoning. POET opens a new gate on reasoning-enhancement pre-training, and we hope our analysis would shed light on the future research of reasoning like program executors.

pdf
Improving Large-Scale Conversational Assistants using Model Interpretation based Training Sample Selection
Stefan Schroedl | Manoj Kumar | Kiana Hajebi | Morteza Ziyadi | Sriram Venkatapathy | Anil Ramakrishna | Rahul Gupta | Pradeep Natarajan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

This paper presents an approach to identify samples from live traffic where the customer implicitly communicated satisfaction with Alexa’s responses, by leveraging interpretations of model behavior. Such customer signals are noisy and adding a large number of samples from live traffic to training set makes re-training infeasible. Our work addresses these challenges by identifying a small number of samples that grow training set by ~0.05% while producing statistically significant improvements in both offline and online tests.