Abstract
Current supervised parsers are limited by the size of their labelled training data, making improving them with unlabelled data an important goal. We show how a state-of-the-art CCG parser can be enhanced, by predicting lexical categories using unsupervised vector-space embeddings of words. The use of word embeddings enables our model to better generalize from the labelled data, and allows us to accurately assign lexical categories without depending on a POS-tagger. Our approach leads to substantial improvements in dependency parsing results over the standard supervised CCG parser when evaluated on Wall Street Journal (0.8%), Wikipedia (1.8%) and biomedical (3.4%) text. We compare the performance of two recently proposed approaches for classification using a wide variety of word embeddings. We also give a detailed error analysis demonstrating where using embeddings outperforms traditional feature sets, and showing how including POS features can decrease accuracy.- Anthology ID:
- Q14-1026
- Volume:
- Transactions of the Association for Computational Linguistics, Volume 2
- Month:
- Year:
- 2014
- Address:
- Cambridge, MA
- Venue:
- TACL
- SIG:
- Publisher:
- MIT Press
- Note:
- Pages:
- 327–338
- Language:
- URL:
- https://aclanthology.org/Q14-1026
- DOI:
- 10.1162/tacl_a_00186
- Cite (ACL):
- Mike Lewis and Mark Steedman. 2014. Improved CCG Parsing with Semi-supervised Supertagging. Transactions of the Association for Computational Linguistics, 2:327–338.
- Cite (Informal):
- Improved CCG Parsing with Semi-supervised Supertagging (Lewis & Steedman, TACL 2014)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/Q14-1026.pdf
- Data
- Penn Treebank