Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Ehsan Shareghi; Matthias Petri; Gholamreza Haffari; Trevor Cohn

doi:10.1162/tacl_a_00112

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn

Abstract

Efficient methods for storing and querying are critical for scaling high-order m-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500×, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

Anthology ID:: Q16-1034
Volume:: Transactions of the Association for Computational Linguistics, Volume 4
Month:
Year:: 2016
Address:: Cambridge, MA
Editors:: Lillian Lee, Mark Johnson, Kristina Toutanova
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 477–490
Language:
URL:: https://aclanthology.org/Q16-1034
DOI:: 10.1162/tacl_a_00112
Bibkey:
Cite (ACL):: Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, and Trevor Cohn. 2016. Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees. Transactions of the Association for Computational Linguistics, 4:477–490.
Cite (Informal):: Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees (Shareghi et al., TACL 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/Q16-1034.pdf
Video:: https://preview.aclanthology.org/naacl24-info/Q16-1034.mp4
Code: eehsan/cstlm

PDF Search Code Video