Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice

Julian Brooke, Jan Šnajder, Timothy Baldwin


Abstract
We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates. In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item. We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages.
Anthology ID:
Q17-1032
Volume:
Transactions of the Association for Computational Linguistics, Volume 5
Month:
Year:
2017
Address:
Cambridge, MA
Editors:
Lillian Lee, Mark Johnson, Kristina Toutanova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
455–470
Language:
URL:
https://aclanthology.org/Q17-1032
DOI:
10.1162/tacl_a_00073
Bibkey:
Cite (ACL):
Julian Brooke, Jan Šnajder, and Timothy Baldwin. 2017. Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice. Transactions of the Association for Computational Linguistics, 5:455–470.
Cite (Informal):
Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice (Brooke et al., TACL 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/Q17-1032.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-4/Q17-1032.mp4