Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP

Gaurav Arora, Afshin Rahimi, Timothy Baldwin


Abstract
Catastrophic forgetting — whereby a model trained on one task is fine-tuned on a second, and in doing so, suffers a “catastrophic” drop in performance over the first task — is a hurdle in the development of better transfer learning techniques. Despite impressive progress in reducing catastrophic forgetting, we have limited understanding of how different architectures and hyper-parameters affect forgetting in a network. With this study, we aim to understand factors which cause forgetting during sequential training. Our primary finding is that CNNs forget less than LSTMs. We show that max-pooling is the underlying operation which helps CNNs alleviate forgetting compared to LSTMs. We also found that curriculum learning, placing a hard task towards the end of task sequence, reduces forgetting. We analysed the effect of fine-tuning contextual embeddings on catastrophic forgetting and found that using embeddings as feature extractor is preferable to fine-tuning in continual learning setup.
Anthology ID:
U19-1011
Volume:
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association
Month:
4--6 December
Year:
2019
Address:
Sydney, Australia
Venue:
ALTA
SIG:
Publisher:
Australasian Language Technology Association
Note:
Pages:
77–86
Language:
URL:
https://aclanthology.org/U19-1011
DOI:
Bibkey:
Cite (ACL):
Gaurav Arora, Afshin Rahimi, and Timothy Baldwin. 2019. Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP. In Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association, pages 77–86, Sydney, Australia. Australasian Language Technology Association.
Cite (Informal):
Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP (Arora et al., ALTA 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/U19-1011.pdf