HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation

Zhoujun Cheng; Haoyu Dong; Zhiruo Wang; Ran Jia; Jiaqi Guo; Yan Gao; Shi Han; Jian-Guang Lou; Dongmei Zhang

doi:10.18653/v1/2022.acl-long.78

HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation

Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, Dongmei Zhang

Abstract

Tables are often created with hierarchies, but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables. Hierarchical tables challenge numerical reasoning by complex hierarchical indexing, as well as implicit relationships of calculation and semantics. We present a new dataset, HiTab, to study question answering (QA) and natural language generation (NLG) over hierarchical tables. HiTab is a cross-domain dataset constructed from a wealth of statistical reports and Wikipedia pages, and has unique characteristics: (1) nearly all tables are hierarchical, and (2) QA pairs are not proposed by annotators from scratch, but are revised from real and meaningful sentences authored by analysts. (3) to reveal complex numerical reasoning in statistical reports, we provide fine-grained annotations of quantity and entity alignment. Experiments suggest that this HiTab presents a strong challenge for existing baselines and a valuable benchmark for future research. Targeting hierarchical structure, we devise a hierarchy-aware logical form for symbolic reasoning over tables, which shows high effectiveness. Targeting table reasoning, we leverage entity and quantity alignment to explore partially supervised training in QA and conditional generation in NLG, and largely reduce spurious predictions in QA and produce better descriptions in NLG.

Anthology ID:: 2022.acl-long.78
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1094–1110
Language:
URL:: https://aclanthology.org/2022.acl-long.78
DOI:: 10.18653/v1/2022.acl-long.78
Bibkey:
Cite (ACL):: Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. 2022. HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1094–1110, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation (Cheng et al., ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2022.acl-long.78.pdf
Software:: 2022.acl-long.78.software.zip
Code: microsoft/hitab
Data: FinQA, TAT-QA, ToTTo, WikiSQL

PDF Search Code Software