Chiachun Hsieh


2022

pdf bib
FeTaQA: Free-form Table Question Answering
Linyong Nan | Chiachun Hsieh | Ziming Mao | Xi Victoria Lin | Neha Verma | Rui Zhang | Wojciech Kryściński | Hailey Schoelkopf | Riley Kong | Xiangru Tang | Mutethia Mutuma | Ben Rosand | Isabel Trindade | Renusree Bandaru | Jacob Cunningham | Caiming Xiong | Dragomir Radev | Dragomir Radev
Transactions of the Association for Computational Linguistics, Volume 10

Existing table question answering datasets contain abundant factual questions that primarily evaluate a QA system’s comprehension of query and tabular data. However, restricted by their short-form answers, these datasets fail to include question–answer interactions that represent more advanced and naturally occurring information needs: questions that ask for reasoning and integration of information pieces retrieved from a structured knowledge source. To complement the existing datasets and to reveal the challenging nature of the table-based question answering task, we introduce FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs. FeTaQA is collected from noteworthy descriptions of Wikipedia tables that contain information people tend to seek; generation of these descriptions requires advanced processing that humans perform on a daily basis: Understand the question and table, retrieve, integrate, infer, and conduct text planning and surface realization to generate an answer. We provide two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.

2021

pdf
DART: Open-Domain Structured Data Record to Text Generation
Linyong Nan | Dragomir Radev | Rui Zhang | Amrit Rau | Abhinand Sivaprasad | Chiachun Hsieh | Xiangru Tang | Aadit Vyas | Neha Verma | Pranav Krishna | Yangxiaokang Liu | Nadia Irwanto | Jessica Pan | Faiaz Rahman | Ahmad Zaidi | Mutethia Mutuma | Yasin Tarabar | Ankit Gupta | Tao Yu | Yi Chern Tan | Xi Victoria Lin | Caiming Xiong | Richard Socher | Nazneen Fatema Rajani
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.