Rewiring the Transformer with Depth-Wise LSTMs

Hongfei Xu, Yang Song, Qiuhui Liu, Josef van Genabith, Deyi Xiong


Abstract
Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model “forget” distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
Anthology ID:
2024.lrec-main.1231
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
14122–14133
Language:
URL:
https://aclanthology.org/2024.lrec-main.1231
DOI:
Bibkey:
Cite (ACL):
Hongfei Xu, Yang Song, Qiuhui Liu, Josef van Genabith, and Deyi Xiong. 2024. Rewiring the Transformer with Depth-Wise LSTMs. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14122–14133, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Rewiring the Transformer with Depth-Wise LSTMs (Xu et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.1231.pdf