Abstract
Large pre-trained language models for textual data have an unconstrained output space; at each decoding step, they can produce any of 10,000s of sub-word tokens. When fine-tuned to target constrained formal languages like SQL, these models often generate invalid code, rendering it unusable. We propose PICARD (code available at https://github.com/ElementAI/picard), a method for constraining auto-regressive decoders of language models through incremental parsing. PICARD helps to find valid output sequences by rejecting inadmissible tokens at each decoding step. On the challenging Spider and CoSQL text-to-SQL translation tasks, we show that PICARD transforms fine-tuned T5 models with passable performance into state-of-the-art solutions.- Anthology ID:
- 2021.emnlp-main.779
- Volume:
- Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Online and Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 9895–9901
- Language:
- URL:
- https://aclanthology.org/2021.emnlp-main.779
- DOI:
- 10.18653/v1/2021.emnlp-main.779
- Cite (ACL):
- Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models (Scholak et al., EMNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2021.emnlp-main.779.pdf
- Code
- ElementAI/picard + additional community code
- Data
- CoSQL, Spider-Realistic