Abstract
To approximately parse an unfamiliar language, it helps to have a treebank of a similar language. But what if the closest available treebank still has the wrong word order? We show how to (stochastically) permute the constituents of an existing dependency treebank so that its surface part-of-speech statistics approximately match those of the target language. The parameters of the permutation model can be evaluated for quality by dynamic programming and tuned by gradient descent (up to a local optimum). This optimization procedure yields trees for a new artificial language that resembles the target language. We show that delexicalized parsers for the target language can be successfully trained using such “made to order” artificial languages.- Anthology ID:
- D18-1163
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1325–1337
- Language:
- URL:
- https://aclanthology.org/D18-1163
- DOI:
- 10.18653/v1/D18-1163
- Cite (ACL):
- Dingquan Wang and Jason Eisner. 2018. Synthetic Data Made to Order: The Case of Parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1325–1337, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Synthetic Data Made to Order: The Case of Parsing (Wang & Eisner, EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/D18-1163.pdf
- Code
- wddabc/ordersynthetic