Abstract
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our Past Decode Regularization (PDR) method improves perplexity on the Penn Treebank dataset by up to 1.8 points and by up to 2.3 points on the WikiText-2 dataset, over strong regularized baselines using a single softmax. With a mixture-of-softmax model, we show gains of up to 1.0 perplexity points on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling.- Anthology ID:
- P19-1142
- Volume:
- Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1468–1476
- Language:
- URL:
- https://aclanthology.org/P19-1142
- DOI:
- 10.18653/v1/P19-1142
- Cite (ACL):
- Siddhartha Brahma. 2019. Improved Language Modeling by Decoding the Past. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1468–1476, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Improved Language Modeling by Decoding the Past (Brahma, ACL 2019)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/P19-1142.pdf
- Data
- Penn Treebank, WikiText-2