Xi Victoria Lin

U of Washington, Meta


2024

pdf
FOLIO: Natural Language Reasoning with First-Order Logic
Simeng Han | Hailey Schoelkopf | Yilun Zhao | Zhenting Qi | Martin Riddell | Wenfei Zhou | James Coady | David Peng | Yujie Qiao | Luke Benson | Lucy Sun | Alexander Wardle-Solano | Hannah Szabó | Ekaterina Zubova | Matthew Burtell | Jonathan Fan | Yixin Liu | Brian Wong | Malcolm Sailor | Ansong Ni | Linyong Nan | Jungo Kasai | Tao Yu | Rui Zhang | Alexander Fabbri | Wojciech Maciej Kryscinski | Semih Yavuz | Ye Liu | Xi Victoria Lin | Shafiq Joty | Yingbo Zhou | Caiming Xiong | Rex Ying | Arman Cohan | Dragomir Radev
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO remains a challenge for one of the most capable Large Language Model (LLM) publicly available, GPT-4.

2023

pdf
Training Trajectories of Language Models Across Scales
Mengzhou Xia | Mikel Artetxe | Chunting Zhou | Xi Victoria Lin | Ramakanth Pasunuru | Danqi Chen | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al., 2022)—from 125M to 175B parameters—on next-token prediction, sequence-level generation and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior (Nakkiran et al., 2020); 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; and 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

pdf
Reimagining Retrieval Augmented Language Models for Answering Queries
Wang-Chiew Tan | Yuliang Li | Pedro Rodriguez | Richard James | Xi Victoria Lin | Alon Halevy | Wen-tau Yih
Findings of the Association for Computational Linguistics: ACL 2023

We present a reality check on large language models and inspect the promise of retrieval-augmented language models in comparison. Such language models are semi-parametric, where models integrate model parameters and knowledge from external data sources to make their predictions, as opposed to the parametric nature of vanilla large language models. We give initial experimental findings that semi-parametric architectures can be enhanced with views, a query analyzer/planner, and provenance to make a significantly more powerful system for question answering in terms of accuracy and efficiency, and potentially for other NLP tasks.

2022

pdf
Few-shot Learning with Multilingual Generative Language Models
Xi Victoria Lin | Todor Mihaylov | Mikel Artetxe | Tianlu Wang | Shuohui Chen | Daniel Simig | Myle Ott | Naman Goyal | Shruti Bhosale | Jingfei Du | Ramakanth Pasunuru | Sam Shleifer | Punit Singh Koura | Vishrav Chaudhary | Brian O’Horo | Jeff Wang | Luke Zettlemoyer | Zornitsa Kozareva | Mona Diab | Veselin Stoyanov | Xian Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples.


Efficient Large Scale Language Modeling with Mixtures of Experts
Mikel Artetxe | Shruti Bhosale | Naman Goyal | Todor Mihaylov | Myle Ott | Sam Shleifer | Xi Victoria Lin | Jingfei Du | Srinivasan Iyer | Ramakanth Pasunuru | Giridharan Anantharaman | Xian Li | Shuohui Chen | Halil Akin | Mandeep Baines | Louis Martin | Xing Zhou | Punit Singh Koura | Brian O’Horo | Jeffrey Wang | Luke Zettlemoyer | Mona Diab | Zornitsa Kozareva | Veselin Stoyanov
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ~4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

pdf bib
FeTaQA: Free-form Table Question Answering
Linyong Nan | Chiachun Hsieh | Ziming Mao | Xi Victoria Lin | Neha Verma | Rui Zhang | Wojciech Kryściński | Hailey Schoelkopf | Riley Kong | Xiangru Tang | Mutethia Mutuma | Ben Rosand | Isabel Trindade | Renusree Bandaru | Jacob Cunningham | Caiming Xiong | Dragomir Radev | Dragomir Radev
Transactions of the Association for Computational Linguistics, Volume 10

Existing table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question–answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables that contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: Understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.

2021

pdf bib
Stage-wise Fine-tuning for Graph-to-Text Generation
Qingyun Wang | Semih Yavuz | Xi Victoria Lin | Heng Ji | Nazneen Rajani
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Graph-to-text generation has benefited from pre-trained language models (PLMs) in achieving better performance than structured graph encoders. However, they fail to fully utilize the structure information of the input graph. In this paper, we aim to further improve the performance of the pre-trained language model by proposing a structured graph-to-text model with a two-step fine-tuning mechanism which first fine-tunes model on Wikipedia before adapting to the graph-to-text generation. In addition to using the traditional token and position embeddings to encode the knowledge graph (KG), we propose a novel tree-level embedding method to capture the inter-dependency structures of the input graph. This new approach has significantly improved the performance of all text generation metrics for the English WebNLG 2017 dataset.

pdf
Testing Cross-Database Semantic Parsers With Canonical Utterances
Heather Lent | Semih Yavuz | Tao Yu | Tong Niu | Yingbo Zhou | Dragomir Radev | Xi Victoria Lin
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

The benchmark performance of cross-database semantic parsing has climbed steadily in recent years, catalyzed by the wide adoption of pre-trained language models. Yet existing work have shown that state-of-the-art cross-database semantic parsers struggle to generalize to novel user utterances, databases and query structures. To obtain transparent details on the strengths and limitation of these models, we propose a diagnostic testing approach based on controlled synthesis of canonical natural language and SQL pairs. Inspired by the CheckList, we characterize a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data. We evaluated a variety of high performing models using the proposed approach, and identified several non-obvious weaknesses across models (e.g. unable to correctly select many columns). Our dataset and code are released as a test suite at http://github.com/hclent/BehaviorCheckingSemPar.

pdf
DART: Open-Domain Structured Data Record to Text Generation
Linyong Nan | Dragomir Radev | Rui Zhang | Amrit Rau | Abhinand Sivaprasad | Chiachun Hsieh | Xiangru Tang | Aadit Vyas | Neha Verma | Pranav Krishna | Yangxiaokang Liu | Nadia Irwanto | Jessica Pan | Faiaz Rahman | Ahmad Zaidi | Mutethia Mutuma | Yasin Tarabar | Ankit Gupta | Tao Yu | Yi Chern Tan | Xi Victoria Lin | Caiming Xiong | Richard Socher | Nazneen Fatema Rajani
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.

pdf
Learning to Synthesize Data for Semantic Parsing
Bailin Wang | Wenpeng Yin | Xi Victoria Lin | Caiming Xiong
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations
Avi Sil | Xi Victoria Lin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

2020

pdf
Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation
Tianlu Wang | Xi Victoria Lin | Nazneen Fatema Rajani | Bryan McCann | Vicente Ordonez | Caiming Xiong
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

pdf
Photon: A Robust Cross-Domain Text-to-SQL System
Jichuan Zeng | Xi Victoria Lin | Steven C.H. Hoi | Richard Socher | Caiming Xiong | Michael Lyu | Irwin King
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Natural language interfaces to databases(NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present PHOTON, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. PHOTON consists of a strong neural semantic parser (63.2% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector isa discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at http://www.naturalsql.com

pdf bib
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing
Ben Bogin | Srinivasan Iyer | Xi Victoria Lin | Dragomir Radev | Alane Suhr | Panupong | Caiming Xiong | Pengcheng Yin | Tao Yu | Rui Zhang | Victor Zhong
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing

pdf
ColloQL: Robust Text-to-SQL Over Search Queries
Karthik Radhakrishnan | Arvind Srikantan | Xi Victoria Lin
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing

Translating natural language utterances to executable queries is a helpful technique in making the vast amount of data stored in relational databases accessible to a wider range of non-tech-savvy end users. Prior work in this area has largely focused on textual input that is linguistically correct and semantically unambiguous. However, real-world user queries are often succinct, colloquial, and noisy, resembling the input of a search engine. In this work, we introduce data augmentation techniques and a sampling-based content-aware BERT model (ColloQL) to achieve robust text-to-SQL modeling over natural language search (NLS) questions. Due to the lack of evaluation data, we curate a new dataset of NLS questions and demonstrate the efficacy of our approach. ColloQL’s superior performance extends to well-formed text, achieving an 84.9% (logical) and 90.7% (execution) accuracy on the WikiSQL dataset, making it, to the best of our knowledge, the highest performing model that does not use execution guided decoding.

pdf
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
Xi Victoria Lin | Richard Socher | Caiming Xiong
Findings of the Association for Computational Linguistics: EMNLP 2020

We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the text-DB contextualization is realized via the fine-tuned deep attention in BERT. Combined with a pointer-generator decoder with schema-consistency driven search space pruning, BRIDGE attained state-of-the-art performance on the well-studied Spider benchmark (65.5% dev, 59.2% test), despite being much simpler than most recently proposed models for this task. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks. Our model implementation is available at https://github.com/salesforce/TabularSemanticParsing.

2019

pdf
SParC: Cross-Domain Semantic Parsing in Context
Tao Yu | Rui Zhang | Michihiro Yasunaga | Yi Chern Tan | Xi Victoria Lin | Suyi Li | Heyang Er | Irene Li | Bo Pang | Tao Chen | Emily Ji | Shreya Dixit | David Proctor | Sungrok Shim | Jonathan Kraft | Vincent Zhang | Caiming Xiong | Richard Socher | Dragomir Radev
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present SParC, a dataset for cross-domainSemanticParsing inContext that consists of 4,298 coherent question sequences (12k+ individual questions annotated with SQL queries). It is obtained from controlled user interactions with 200 complex databases over 138 domains. We provide an in-depth analysis of SParC and show that it introduces new challenges compared to existing datasets. SParC demonstrates complex contextual dependencies, (2) has greater semantic diversity, and (3) requires generalization to unseen domains due to its cross-domain nature and the unseen databases at test time. We experiment with two state-of-the-art text-to-SQL models adapted to the context-dependent, cross-domain setup. The best model obtains an exact match accuracy of 20.2% over all questions and less than10% over all interaction sequences, indicating that the cross-domain setting and the con-textual phenomena of the dataset present significant challenges for future research. The dataset, baselines, and leaderboard are released at https://yale-lily.github.io/sparc.

pdf
CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
Tao Yu | Rui Zhang | Heyang Er | Suyi Li | Eric Xue | Bo Pang | Xi Victoria Lin | Yi Chern Tan | Tianze Shi | Zihan Li | Youxuan Jiang | Michihiro Yasunaga | Sungrok Shim | Tao Chen | Alexander Fabbri | Zifan Li | Luyao Chen | Yuwen Zhang | Shreya Dixit | Vincent Zhang | Caiming Xiong | Richard Socher | Walter Lasecki | Dragomir Radev
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We present CoSQL, a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions. When user questions are answerable by SQL, the expert describes the SQL and execution results to the user, hence maintaining a natural interaction flow. CoSQL introduces new challenges compared to existing task-oriented dialogue datasets: (1) the dialogue states are grounded in SQL, a domain-independent executable representation, instead of domain-specific slot value pairs, and (2) because testing is done on unseen databases, success requires generalizing to new domains. CoSQL includes three tasks: SQL-grounded dialogue state tracking, response generation from query results, and user dialogue act prediction. We evaluate a set of strong baselines for each task and show that CoSQL presents significant challenges for future research. The dataset, baselines, and leaderboard will be released at https://yale-lily.github.io/cosql.

pdf
Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions
Rui Zhang | Tao Yu | Heyang Er | Sungrok Shim | Eric Xue | Xi Victoria Lin | Tianze Shi | Caiming Xiong | Richard Socher | Dragomir Radev
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We focus on the cross-domain context-dependent text-to-SQL generation task. Based on the observation that adjacent natural language questions are often linguistically dependent and their corresponding SQL queries tend to overlap, we utilize the interaction history by editing the previous predicted query to improve the generation quality. Our editing mechanism views SQL as sequences and reuses generation results at the token level in a simple manner. It is flexible to change individual tokens and robust to error propagation. Furthermore, to deal with complex table structures in different domains, we employ an utterance-table encoder and a table-aware decoder to incorporate the context of the user utterance and the table schema. We evaluate our approach on the SParC dataset and demonstrate the benefit of editing compared with the state-of-the-art baselines which generate SQL from scratch. Our code is available at https://github.com/ryanzhumich/sparc_atis_pytorch.

2018

pdf
NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System
Xi Victoria Lin | Chenglong Wang | Luke Zettlemoyer | Michael D. Ernst
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Multi-Hop Knowledge Graph Reasoning with Reward Shaping
Xi Victoria Lin | Richard Socher | Caiming Xiong
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Multi-hop reasoning is an effective approach for query answering (QA) over incomplete knowledge graphs (KGs). The problem can be formulated in a reinforcement learning (RL) setup, where a policy-based agent sequentially extends its inference path until it reaches a target. However, in an incomplete KG environment, the agent receives low-quality rewards corrupted by false negatives in the training data, which harms generalization at test time. Furthermore, since no golden action sequence is used for training, the agent can be misled by spurious search trajectories that incidentally lead to the correct answer. We propose two modeling advances to address both issues: (1) we reduce the impact of false negative supervision by adopting a pretrained one-hop embedding model to estimate the reward of unobserved facts; (2) we counter the sensitivity to spurious paths of on-policy RL by forcing the agent to explore a diverse set of paths using randomly generated edge masks. Our approach significantly improves over existing path-based KGQA models on several benchmark datasets and is comparable or better than embedding-based models.

2016

pdf
Compositional Learning of Embeddings for Relation Paths in Knowledge Base and Text
Kristina Toutanova | Xi Victoria Lin | Wen-tau Yih | Hoifung Poon | Chris Quirk
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Search
Co-authors