Abstract
Embedding and projection matrices are commonly used in neural language models (NLM) as well as in other sequence processing networks that operate on large vocabularies. We examine such matrices in fine-tuned language models and observe that a NLM learns word vectors whose norms are related to the word frequencies. We show that by initializing the weight norms with scaled log word counts, together with other techniques, lower perplexities can be obtained in early epochs of training. We also introduce a weight norm regularization loss term, whose hyperparameters are tuned via a grid search. With this method, we are able to significantly improve perplexities on two word-level language modeling tasks (without dynamic evaluation): from 54.44 to 53.16 on Penn Treebank (PTB) and from 61.45 to 60.13 on WikiText-2 (WT2).- Anthology ID:
- W18-6310
- Volume:
- Proceedings of the Third Conference on Machine Translation: Research Papers
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 93–100
- Language:
- URL:
- https://aclanthology.org/W18-6310
- DOI:
- 10.18653/v1/W18-6310
- Cite (ACL):
- Christian Herold, Yingbo Gao, and Hermann Ney. 2018. Improving Neural Language Models with Weight Norm Initialization and Regularization. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Improving Neural Language Models with Weight Norm Initialization and Regularization (Herold et al., WMT 2018)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/W18-6310.pdf
- Data
- WikiText-2