Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models

Yiben Yang; Ji-Ping Wang; Doug Downey

doi:10.18653/v1/N19-1330

Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models

Abstract

Recurrent neural network language models (RNNLM) form a valuable foundation for many NLP systems, but training the models can be computationally expensive, and may take days to train on a large corpus. We explore a technique that uses large corpus n-gram statistics as a regularizer for training a neural network LM on a smaller corpus. In experiments with the Billion-Word and Wikitext corpora, we show that the technique is effective, and more time-efficient than simply training on a larger sequential corpus. We also introduce new strategies for selecting the most informative n-grams, and show that these boost efficiency.

Anthology ID:: N19-1330
Volume:: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:: June
Year:: 2019
Address:: Minneapolis, Minnesota
Editors:: Jill Burstein, Christy Doran, Thamar Solorio
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3268–3273
Language:
URL:: https://aclanthology.org/N19-1330
DOI:: 10.18653/v1/N19-1330
Bibkey:
Cite (ACL):: Yiben Yang, Ji-Ping Wang, and Doug Downey. 2019. Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3268–3273, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):: Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models (Yang et al., NAACL 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/N19-1330.pdf

PDF Search