Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

Hyunji Lee, Wenhao Yu, Hongming Zhang, Kaixin Ma, Jiyeon Kim, Dong Yu, Minjoon Seo


Abstract
Hybrid models that combine state space models (SSMs) with attention mechanisms have demonstrated strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the underlying reasons for these benefits remain insufficiently understood. In this work, we investigate hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We focus in particular on the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. Among various configurations, parallel hybrids using a cross-attention to combine SSM and attention outputs perform best. We also introduce a data-centric approach to further improve model performance: continual training on datasets with paraphrases. This method strikes the best balance across various other datasets, enhancing memory recall while preserving other capabilities. It generalizes well across different base models, including pure SSMs, and outperforms architectural modifications aimed at enhancing recall.
Anthology ID:
2025.babylm-main.27
Volume:
Proceedings of the First BabyLM Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Y. Hu, Jing Liu, Jaap Jumelet, Tal Linzen, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Gotlieb Wilcox, Adina Williams
Venue:
BabyLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
380–398
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.27/
DOI:
Bibkey:
Cite (ACL):
Hyunji Lee, Wenhao Yu, Hongming Zhang, Kaixin Ma, Jiyeon Kim, Dong Yu, and Minjoon Seo. 2025. Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling. In Proceedings of the First BabyLM Workshop, pages 380–398, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling (Lee et al., BabyLM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.babylm-main.27.pdf