Visualizing and Understanding the Effectiveness of BERT

Yaru Hao; Li Dong; Furu Wei; Ke Xu

doi:10.18653/v1/D19-1424

Visualizing and Understanding the Effectiveness of BERT

Abstract

Language model pre-training, such as BERT, has achieved remarkable results in many NLP tasks. However, it is unclear why the pre-training-then-fine-tuning paradigm can improve performance and generalization capability across different tasks. In this paper, we propose to visualize loss landscapes and optimization trajectories of fine-tuning BERT on specific datasets. First, we find that pre-training reaches a good initial point across downstream tasks, which leads to wider optima and easier optimization compared with training from scratch. We also demonstrate that the fine-tuning procedure is robust to overfitting, even though BERT is highly over-parameterized for downstream tasks. Second, the visualization results indicate that fine-tuning BERT tends to generalize better because of the flat and wide optima, and the consistency between the training loss surface and the generalization error surface. Third, the lower layers of BERT are more invariant during fine-tuning, which suggests that the layers that are close to input learn more transferable representations of language.

Anthology ID:: D19-1424
Volume:: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:: EMNLP | IJCNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4143–4152
Language:
URL:: https://aclanthology.org/D19-1424
DOI:: 10.18653/v1/D19-1424
Bibkey:
Cite (ACL):: Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and Understanding the Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4143–4152, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Visualizing and Understanding the Effectiveness of BERT (Hao et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/D19-1424.pdf
Data: GLUE, MRPC, MultiNLI, SST, SST-2

PDF Search