Chun Wei Chen
2026
An Exploration of Mamba for Speech Self-Supervised Models
Tzu-Quan Lin | Heng-Cheng Kuo | Tzu-Chieh Wei | Hsi-Chun Cheng | Chun Wei Chen | Hsien-Fu Hsiao | Yu Tsao | Hung-yi Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tzu-Quan Lin | Heng-Cheng Kuo | Tzu-Chieh Wei | Hsi-Chun Cheng | Chun Wei Chen | Hsien-Fu Hsiao | Yu Tsao | Hung-yi Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised learning (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction. The codebase is available at https://github.com/hckuo145/Mamba-based-HuBERT.