AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Ruiqi Li; Rongjie Huang; Lichao Zhang; Jinglin Liu; Zhou Zhao

doi:10.18653/v1/2023.findings-acl.442

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Ruiqi Li, Rongjie Huang, Lichao Zhang, Jinglin Liu, Zhou Zhao

Abstract

The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, which views speech variance such as pitch and content as different modalities. Inspired by the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1) adopts a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is computed in a simple yet effective way and is quantized into a discrete space; and 2) uses the predicted rhythm representation to re-align the content based on cross-attention and conducts a cross-modal fusion for re-synthesize. Extensive experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://alignsts.github.io.

Anthology ID:: 2023.findings-acl.442
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7074–7088
Language:
URL:: https://aclanthology.org/2023.findings-acl.442
DOI:: 10.18653/v1/2023.findings-acl.442
Bibkey:
Cite (ACL):: Ruiqi Li, Rongjie Huang, Lichao Zhang, Jinglin Liu, and Zhou Zhao. 2023. AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7074–7088, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment (Li et al., Findings 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/improve-issue-templates/2023.findings-acl.442.pdf

PDF Search