Self-Supervised Contrastive Learning for Content-Centric Speech Representation

Lijinlong Lijinlong; Ling Dong; Wenjun Wang (王文君); Zhengtao Yu (余正涛); Shengxiang Gao

Self-Supervised Contrastive Learning for Content-Centric Speech Representation

Lijinlong Lijinlong, Ling Dong, Wenjun Wang, Zhengtao Yu, Shengxiang Gao

Abstract

"Self-supervised learning (SSL) speech models have achieved remarkable performance across various tasks, with the learned representations often exhibiting a high degree of generality and applicability to multiple downstream tasks. However, these representations contain both speech content and some paralinguistic information, which may be redundant for content-focused tasks.Decoupling this redundant information is challenging. To address this issue, we propose a Self-Supervised Contrastive Representation Learning method (SSCRL), which effectively disentangles paralinguistic information from speech content by aligning similar content speech representations in the feature space using self-supervised contrastive learning with pitch perturbation and speaker perturbation features. Experimental results demonstrate that the proposed method, when fine-tuned on the LibriSpeech 100-hour dataset, achieves superior performance across all content-related tasks in the SUPERB Benchmark, generally outperforming prior approaches."

Anthology ID:: 2025.ccl-1.61
Volume:: Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Month:: August
Year:: 2025
Address:: Jinan, China
Editors:: Maosong Sun, Peiyong Duan, Zhiyuan Liu, Ruifeng Xu, Weiwei Sun
Venue:: CCL
SIG:
Publisher:: Chinese Information Processing Society of China
Note:
Pages:: 807–817
Language:
URL:: https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.61/
DOI:
Bibkey:
Cite (ACL):: Lijinlong Lijinlong, Ling Dong, Wenjun Wang, Zhengtao Yu, and Shengxiang Gao. 2025. Self-Supervised Contrastive Learning for Content-Centric Speech Representation. In Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025), pages 807–817, Jinan, China. Chinese Information Processing Society of China.
Cite (Informal):: Self-Supervised Contrastive Learning for Content-Centric Speech Representation (Lijinlong et al., CCL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.61.pdf

PDF Cite Search Fix data