Abstract
Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. In experiments, neural semantic parsers using TaBERT as feature representation layers achieve new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.- Anthology ID:
- 2020.acl-main.745
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8413–8426
- Language:
- URL:
- https://aclanthology.org/2020.acl-main.745
- DOI:
- 10.18653/v1/2020.acl-main.745
- Cite (ACL):
- Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426, Online. Association for Computational Linguistics.
- Cite (Informal):
- TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data (Yin et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/2020.acl-main.745.pdf
- Code
- facebookresearch/tabert
- Data
- WikiTableQuestions