InstructAny2Pix: Image Editing with Multi-Modal Prompts

Shufan Li; Harkanwar Singh; Aditya Grover

doi:10.18653/v1/2025.findings-naacl.36

InstructAny2Pix: Image Editing with Multi-Modal Prompts

Shufan Li, Harkanwar Singh, Aditya Grover

Abstract

Image Editing has made incredible progress in recent years. Earliest work only supported caption-guided editing. Recently, free-form text instructions and reference images are incorporated to allow more flexibility. However, existing methods still struggle with complicated editing instructions involving multiple objects or reference images. We present InstructAny2Pix, a novel image editing model that leverages a multi-modal LLM to execute complicated edit instructions. Compared with previous, works, InstructAny2Pix extends the flexibility of edit instructions in three ways: First, it can perform complex instructions involving multiple object edits; Second, it supports interleaving text instructions with multiple reference images; Third, it supports audio and music inputs as part of edit prompts, unlocking many creative applications, such as album cover generation and music-inspired merchandise design. To evaluate the effectiveness of InstructAny2Pix, we propose two new benchmark datasets MM-Inst and Dream-booth++ consisting of human written, multi-modal prompts. InstructAny2Pix outperforms baselines in these two proposed multi-modal benchmarks, as well as conventional image editing benchmarks such as InstructPix2Pix.

Anthology ID:: 2025.findings-naacl.36
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 594–619
Language:
URL:: https://preview.aclanthology.org/corrections-2025-06/2025.findings-naacl.36/
DOI:: 10.18653/v1/2025.findings-naacl.36
Bibkey:
Cite (ACL):: Shufan Li, Harkanwar Singh, and Aditya Grover. 2025. InstructAny2Pix: Image Editing with Multi-Modal Prompts. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 594–619, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: InstructAny2Pix: Image Editing with Multi-Modal Prompts (Li et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-06/2025.findings-naacl.36.pdf

PDF Cite Search Fix data