Ziyang Ma
Other people with similar names: Ziyang Ma
Unverified author pages with similar names: Ziyang Ma
2026
FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
Xiquan Li | Xuenan Xu | Ziyang Ma | Wenxi Chen | Haolin He | Qiuqiang Kong | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiquan Li | Xuenan Xu | Ziyang Ma | Wenxi Chen | Haolin He | Qiuqiang Kong | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Contrastively pretrained audio–language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks.Existing extensions fail to exploit the varying granularity of real-world audio–text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes **Fine**-grained **L**anguage-**A**udio **P**retraining (**FineLAP**), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data.FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder.To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline.Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
Yifan Yang | Bing Han | Hui Wang | Wei Wang | Ziyang Ma | Long Zhou | Zengrui Jin | Guanrou Yang | Tianrui Wang | Xu Tan | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Yang | Bing Han | Hui Wang | Wei Wang | Ziyang Ma | Long Zhou | Zengrui Jin | Guanrou Yang | Tianrui Wang | Xu Tan | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
Wenxi Chen | Ruiqi Yan | Yushen Chen | Zhikang Niu | Ziyang Ma | Xiquan Li | Yuzhe Liang | Wenhanlin | Shunshun Yin | Ming Tao | Xinsheng Wang | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wenxi Chen | Ruiqi Yan | Yushen Chen | Zhikang Niu | Ziyang Ma | Xiquan Li | Yuzhe Liang | Wenhanlin | Shunshun Yin | Ming Tao | Xinsheng Wang | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models. However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, indicating superior naturalness and intelligibility. Moreover, SAC substantially surpasses prior codecs in semantic representation, approaching the level of continuous self-supervised embeddings. When used as a tokenizer for LLM-based text-to-speech, SAC enables a single-stage autoregressive (AR) TTS model that clearly outperforms state-of-the-art AR systems. Our disentanglement analysis further validates the effectiveness of the dual-stream design, offering new potential for controllable speech generation.
Search
Fix author
Co-authors
- Xie Chen 4
- Wenxi Chen 2
- Xiquan Li 2
- Zhikang Niu 2
- Tianrui Wang 2
- Guanrou Yang 2
- Yifan Yang 2
- Yi-Wen Chao 1
- Yushen Chen 1
- Eng Siong Chng 1
- Jianwu Dang 1
- Meng Ge 1
- Cheng Gong 1
- Hu Haifeng 1
- Bing Han 1
- Haolin He 1
- Nana Hou 1
- Zikang Huang 1
- Yu Jiang 1
- Zengrui Jin 1
- Qiuqiang Kong 1
- Xuanchen Li 1
- Yuzhe Liang 1
- Hexin Liu 1
- Tianchi Liu 1
- Yuheng Lu 1
- Yizhou Peng 1
- Chunyu Qiang 1
- Zhongqian Sun 1
- Xu Tan 1
- Ming Tao 1
- Haoyu Wang 1
- Hui Wang 1
- Junyu Wang 1
- Longbiao Wang 1
- Wei Wang 1
- Xiaobao Wang 1
- Xinsheng Wang 1
- Yang Wei 1
- Wenhanlin 1
- Yihao Wu 1
- Xuenan Xu 1
- Ruiqi Yan 1
- Shunshun Yin 1
- Fuming You 1
- Long Zhou 1
Venues
- ACL4