CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning

Xikang Guan, Zheng Gu, Jing Huo, Tianyu Ding, Yang Gao


Abstract
The application of visual-to-music generation (VTM) is rapidly growing. However, current VTM methods struggle with capturing the relationship between visuals and music in open-domain settings, mainly due to two challenges: the lack of large-scale, high-quality visual-music paired datasets and the absence of direct semantic correspondence between visuals and music. In this work, we propose CoT-VTM, a framework that distills Chain-of-Thought (CoT) reasoning to enable visual-to-music generation without paired data, while efficiently producing music aligned with visual content in open-domain settings. We first bridge the gap between visual, music, and text data using appropriate foundation models. Next, we identify key elements of the visual-music relationship and design a CoT prompt for visual-to-music mapping. To fully distill the reasoning of CoT, we incorporate latent information from intermediate reasoning steps as supervisory signals alongside visual and music supervision. Finally, we design a two-stage mapping distillation training process: the first stage uses discriminative MLP modules, while the second uses a generative embedding diffusion model (EDM). Our model achieves optimal performance on both image-to-music and video-to-music tasks. Project page: https://xxkkxxx.github.io/cot-vtm/
Anthology ID:
2025.findings-acl.647
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12493–12510
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.647/
DOI:
Bibkey:
Cite (ACL):
Xikang Guan, Zheng Gu, Jing Huo, Tianyu Ding, and Yang Gao. 2025. CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12493–12510, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning (Guan et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.647.pdf