InstructAny2Pix: Image Editing with Multi-Modal Prompts

Shufan Li, Harkanwar Singh, Aditya Grover


Abstract
Image Editing has made incredible progress in recent years. Earliest work only supported caption-guided editing. Recently, free-form text instructions and reference images are incorporated to allow more flexibility. However, existing methods still struggle with complicated editing instructions involving multiple objects or reference images. We present InstructAny2Pix, a novel image editing model that leverages a multi-modal LLM to execute complicated edit instructions. Compared with previous, works, InstructAny2Pix extends the flexibility of edit instructions in three ways: First, it can perform complex instructions involving multiple object edits; Second, it supports interleaving text instructions with multiple reference images; Third, it supports audio and music inputs as part of edit prompts, unlocking many creative applications, such as album cover generation and music-inspired merchandise design. To evaluate the effectiveness of InstructAny2Pix, we propose two new benchmark datasets MM-Inst and Dream-booth++ consisting of human written, multi-modal prompts. InstructAny2Pix outperforms baselines in these two proposed multi-modal benchmarks, as well as conventional image editing benchmarks such as InstructPix2Pix.
Anthology ID:
2025.findings-naacl.36
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
594–619
Language:
URL:
https://preview.aclanthology.org/corrections-2025-06/2025.findings-naacl.36/
DOI:
10.18653/v1/2025.findings-naacl.36
Bibkey:
Cite (ACL):
Shufan Li, Harkanwar Singh, and Aditya Grover. 2025. InstructAny2Pix: Image Editing with Multi-Modal Prompts. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 594–619, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
InstructAny2Pix: Image Editing with Multi-Modal Prompts (Li et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-06/2025.findings-naacl.36.pdf