STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation

Wenxiang Guo; Yu Zhang; Changhao Pan; Zhiyuan Zhu; Ruiqi Li; ZheTao Chen; Wenhao Xu; Fei Wu; Zhou Zhao

STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation

Wenxiang Guo, Yu Zhang, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, ZheTao Chen, Wenhao Xu, Fei Wu, Zhou Zhao

Abstract

Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing across frame, word, phoneme, note, and sentence levels. The novel non-autoregressive local acoustic encoders enable structured hierarchical representation learning. Experimental validation confirms the framework’s superior performance across multiple evaluation dimensions compared to existing annotation approaches. Furthermore, applications in SVS training demonstrate that models utilizing STARS-annotated data achieve significantly enhanced perceptual naturalness and precise style control. This work not only overcomes critical scalability challenges in the creation of singing datasets but also pioneers new methodologies for controllable singing voice synthesis.

Anthology ID:: 2025.findings-acl.781
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15081–15093
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.781/
DOI:
Bibkey:
Cite (ACL):: Wenxiang Guo, Yu Zhang, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, ZheTao Chen, Wenhao Xu, Fei Wu, and Zhou Zhao. 2025. STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 15081–15093, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation (Guo et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.781.pdf

PDF Cite Search Fix data