CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning

Xikang Guan; Zheng Gu; Jing Huo; Tianyu Ding; Yang Gao (扬 高)

CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning

Xikang Guan, Zheng Gu, Jing Huo, Tianyu Ding, Yang Gao

Abstract

The application of visual-to-music generation (VTM) is rapidly growing. However, current VTM methods struggle with capturing the relationship between visuals and music in open-domain settings, mainly due to two challenges: the lack of large-scale, high-quality visual-music paired datasets and the absence of direct semantic correspondence between visuals and music. In this work, we propose CoT-VTM, a framework that distills Chain-of-Thought (CoT) reasoning to enable visual-to-music generation without paired data, while efficiently producing music aligned with visual content in open-domain settings. We first bridge the gap between visual, music, and text data using appropriate foundation models. Next, we identify key elements of the visual-music relationship and design a CoT prompt for visual-to-music mapping. To fully distill the reasoning of CoT, we incorporate latent information from intermediate reasoning steps as supervisory signals alongside visual and music supervision. Finally, we design a two-stage mapping distillation training process: the first stage uses discriminative MLP modules, while the second uses a generative embedding diffusion model (EDM). Our model achieves optimal performance on both image-to-music and video-to-music tasks. Project page: https://xxkkxxx.github.io/cot-vtm/

Anthology ID:: 2025.findings-acl.647
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12493–12510
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-acl.647/
DOI:
Bibkey:
Cite (ACL):: Xikang Guan, Zheng Gu, Jing Huo, Tianyu Ding, and Yang Gao. 2025. CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12493–12510, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning (Guan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-acl.647.pdf

PDF Cite Search Fix data