Abstract
Catastrophic forgetting — whereby a model trained on one task is fine-tuned on a second, and in doing so, suffers a “catastrophic” drop in performance over the first task — is a hurdle in the development of better transfer learning techniques. Despite impressive progress in reducing catastrophic forgetting, we have limited understanding of how different architectures and hyper-parameters affect forgetting in a network. With this study, we aim to understand factors which cause forgetting during sequential training. Our primary finding is that CNNs forget less than LSTMs. We show that max-pooling is the underlying operation which helps CNNs alleviate forgetting compared to LSTMs. We also found that curriculum learning, placing a hard task towards the end of task sequence, reduces forgetting. We analysed the effect of fine-tuning contextual embeddings on catastrophic forgetting and found that using embeddings as feature extractor is preferable to fine-tuning in continual learning setup.- Anthology ID:
- U19-1011
- Volume:
- Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association
- Month:
- 4--6 December
- Year:
- 2019
- Address:
- Sydney, Australia
- Venue:
- ALTA
- SIG:
- Publisher:
- Australasian Language Technology Association
- Note:
- Pages:
- 77–86
- Language:
- URL:
- https://aclanthology.org/U19-1011
- DOI:
- Cite (ACL):
- Gaurav Arora, Afshin Rahimi, and Timothy Baldwin. 2019. Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP. In Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association, pages 77–86, Sydney, Australia. Australasian Language Technology Association.
- Cite (Informal):
- Does an LSTM forget more than a CNN? An empirical study of catastrophic forgetting in NLP (Arora et al., ALTA 2019)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/U19-1011.pdf