PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain

Jianyi Chen, Zheqi Dai, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, Wei Xue


Abstract
Generating well-structured long music compositions, spanning several minutes, remains a challenge due to inefficient representation and the lack of structured representation. In this paper, we propose PyramidCodec, a hierarchical discrete representation of audio, for long audio-domain music generation. Specifically, we employ residual vector quantization on different levels of features to obtain the hierarchical discrete representation. The highest level of features has the largest hop size, resulting in the most compact token sequence. The quantized higher-level representation is up-sampled and combined with lower-level features to apply residual vector quantization and obtain lower-level discrete representations. Furthermore, we design a hierarchical training strategy to ensure that the details are gradually added with more levels of tokens. By performing hierarchical tokenization, the overall token sequence represents information at various scales, facilitating long-context modeling in music and enabling the generation of well-structured compositions. The experimental results demonstrate that our proposed PyramidCodec achieves competitive performance in terms of reconstruction quality and token per second (TPS). By enabling ultra-long music modeling at the lowest level, the proposed approach facilitates training a language model that can generate well-structured long-form music for up to 3 minutes, whose quality is further demonstrated by subjective and objective evaluations. The samples can be found at https://pyramidcodec.github.io/.
Anthology ID:
2024.findings-emnlp.246
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4253–4263
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.246/
DOI:
10.18653/v1/2024.findings-emnlp.246
Bibkey:
Cite (ACL):
Jianyi Chen, Zheqi Dai, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, and Wei Xue. 2024. PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4253–4263, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain (Chen et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2024.findings-emnlp.246.pdf