Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models

Yiben Yang, Ji-Ping Wang, Doug Downey


Abstract
Recurrent neural network language models (RNNLM) form a valuable foundation for many NLP systems, but training the models can be computationally expensive, and may take days to train on a large corpus. We explore a technique that uses large corpus n-gram statistics as a regularizer for training a neural network LM on a smaller corpus. In experiments with the Billion-Word and Wikitext corpora, we show that the technique is effective, and more time-efficient than simply training on a larger sequential corpus. We also introduce new strategies for selecting the most informative n-grams, and show that these boost efficiency.
Anthology ID:
N19-1330
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3268–3273
Language:
URL:
https://aclanthology.org/N19-1330
DOI:
10.18653/v1/N19-1330
Bibkey:
Cite (ACL):
Yiben Yang, Ji-Ping Wang, and Doug Downey. 2019. Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3268–3273, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models (Yang et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/N19-1330.pdf