Zhikang Niu
2026
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Xiquan Li | Junxi Liu | Yuzhe Liang | Zhikang Niu | Wenxi Chen | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiquan Li | Junxi Liu | Yuzhe Liang | Zhikang Niu | Wenxi Chen | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent years have witnessed remarkable progress in Text-to-Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux-style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous-to-mean curriculum that speeds up convergence and enables training on consumer-grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real-time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also shows strong performance in multi-step generation, enabling smooth transitions across successive synthesis steps.
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
Wenxi Chen | Ruiqi Yan | Yushen Chen | Zhikang Niu | Ziyang Ma | Xiquan Li | Yuzhe Liang | Wenhanlin | Shunshun Yin | Ming Tao | Xinsheng Wang | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wenxi Chen | Ruiqi Yan | Yushen Chen | Zhikang Niu | Ziyang Ma | Xiquan Li | Yuzhe Liang | Wenhanlin | Shunshun Yin | Ming Tao | Xinsheng Wang | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models. However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, indicating superior naturalness and intelligibility. Moreover, SAC substantially surpasses prior codecs in semantic representation, approaching the level of continuous self-supervised embeddings. When used as a tokenizer for LLM-based text-to-speech, SAC enables a single-stage autoregressive (AR) TTS model that clearly outperforms state-of-the-art AR systems. Our disentanglement analysis further validates the effectiveness of the dual-stream design, offering new potential for controllable speech generation.
2025
URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models
Ruiqi Yan | Xiquan Li | Wenxi Chen | Zhikang Niu | Chen Yang | Ziyang Ma | Kai Yu | Xie Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Ruiqi Yan | Xiquan Li | Wenxi Chen | Zhikang Niu | Chen Yang | Ziyang Ma | Kai Yu | Xie Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in large language models (LLMs) have driven significant progress in end-to-end spoken dialogue models (SDMs). In contrast to text-based LLMs, the evaluation framework for SDMs should encompass both cognitive dimensions (e.g., logical reasoning, knowledge) and speech-related aspects (e.g., paralinguistic cues, audio quality). However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose **URO-Bench**, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets, evaluating the spoken dialogue model’s abilities in **U**nderstanding, **R**easoning, and **O**ral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area.
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen | Zhikang Niu | Ziyang Ma | Keqi Deng | Chunhui Wang | JianZhao JianZhao | Kai Yu | Xie Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yushen Chen | Zhikang Niu | Ziyang Ma | Keqi Deng | Chunhui Wang | JianZhao JianZhao | Kai Yu | Xie Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/.
Search
Fix author
Co-authors
- Xie Chen 6
- Wenxi Chen 4
- Xiquan Li 4
- Yuzhe Liang 3
- Ziyang Ma 3
- Ruiqi Yan 3
- Kai Yu 3
- Yushen Chen 2
- Ziyang Ma 2
- Yifan Yang 2
- Yi-Wen Chao 1
- Eng Siong Chng 1
- Jianwu Dang 1
- Keqi Deng 1
- Meng Ge 1
- Cheng Gong 1
- Hu Haifeng 1
- Nana Hou 1
- Yuxuan Hu 1
- Zikang Huang 1
- JianZhao JianZhao 1
- Yu Jiang 1
- Jinyu Li 1
- Xuanchen Li 1
- Hexin Liu 1
- Junxi Liu 1
- Shujie Liu 1
- Tianchi Liu 1
- Zhanxun Liu 1
- Yan Lu 1
- Yuheng Lu 1
- Yizhou Peng 1
- Chunyu Qiang 1
- Zhongqian Sun 1
- Ming Tao 1
- Chunhui Wang 1
- Haoyu Wang 1
- Junyu Wang 1
- Longbiao Wang 1
- Tianrui Wang 1
- Xiaobao Wang 1
- Xinsheng Wang 1
- Yang Wei 1
- Wenhanlin 1
- Yihao Wu 1
- Ruiyang Xu 1
- Chen Yang 1
- Guanrou Yang 1
- Shunshun Yin 1
- Fuming You 1
- Yanqiao Zhu 1