Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao


Abstract
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
Anthology ID:
2024.naacl-long.268
Original:
2024.naacl-long.268v1
Version 2:
2024.naacl-long.268v2
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4780–4794
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.naacl-long.268/
DOI:
10.18653/v1/2024.naacl-long.268
Bibkey:
Cite (ACL):
Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, and Zhou Zhao. 2024. Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4780–4794, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt (Wang et al., NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.naacl-long.268.pdf
Video:
 https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.naacl-long.268.mp4