Abstract
Recurrent neural network language models (RNNLM) form a valuable foundation for many NLP systems, but training the models can be computationally expensive, and may take days to train on a large corpus. We explore a technique that uses large corpus n-gram statistics as a regularizer for training a neural network LM on a smaller corpus. In experiments with the Billion-Word and Wikitext corpora, we show that the technique is effective, and more time-efficient than simply training on a larger sequential corpus. We also introduce new strategies for selecting the most informative n-grams, and show that these boost efficiency.- Anthology ID:
- N19-1330
- Volume:
- Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
- Month:
- June
- Year:
- 2019
- Address:
- Minneapolis, Minnesota
- Editors:
- Jill Burstein, Christy Doran, Thamar Solorio
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3268–3273
- Language:
- URL:
- https://aclanthology.org/N19-1330
- DOI:
- 10.18653/v1/N19-1330
- Cite (ACL):
- Yiben Yang, Ji-Ping Wang, and Doug Downey. 2019. Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3268–3273, Minneapolis, Minnesota. Association for Computational Linguistics.
- Cite (Informal):
- Using Large Corpus N-gram Statistics to Improve Recurrent Neural Language Models (Yang et al., NAACL 2019)
- PDF:
- https://preview.aclanthology.org/naacl24-info/N19-1330.pdf