Cong Han


2025

pdf bib
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li | Xilin Jiang | Cong Han | Nima Mesgarani
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20× faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at https://styletts-zs.github.io/.

2022

pdf bib
Improving Conversational Recommendation Systems’ Quality with Context-Aware Item Meta-Information
Bowen Yang | Cong Han | Yu Li | Lei Zuo | Zhou Yu
Findings of the Association for Computational Linguistics: NAACL 2022

A key challenge of Conversational Recommendation Systems (CRS) is to integrate the recommendation function and the dialog generation function smoothly. Previous works employ graph neural networks with external knowledge graphs (KG) to model individual recommendation items and integrate KGs with language models through attention mechanism for response generation. Although previous approaches prove effective, there is still room for improvement. For example, KG-based approaches only rely on entity relations and bag-of-words to recommend items and neglect the information in the conversational context. We propose to improve the usage of dialog context for both recommendation and response generation using an encoding architecture along with the self-attention mechanism of transformers. In this paper, we propose a simple yet effective architecture comprising a pre-trained language model (PLM) and an item metadata encoder to integrate the recommendation and the dialog generation better. The proposed item encoder learns to map item metadata to embeddings reflecting the rich information of the item, which can be matched with dialog context. The PLM then consumes the context-aware item embeddings and dialog context to generate high-quality recommendations and responses. Experimental results on the benchmark dataset ReDial show that our model obtains state-of-the-art results on both recommendation and response generation tasks.