Overcoming Source Object Grounding for Semantic Image Editing

Yeonjoon Jung, Seungtaek Choi, Seung-won Hwang


Abstract
Recent diffusion models have demonstrated remarkable capabilities in text-to-image generation. However, their stochastic denoising process often causes semantic image editing (SIE) models to misapply textual instructions. That is, models often leave the source object unchanged or erroneously alter the background. We refer to this challenge as source object grounding. To address this challenge, we introduce R-SIE, a region-wise SIE framework. During the inference, R-SIE models noise separately for distinct image regions, enabling precise control over the transformed areas. To reinforce the inference, we devise an automatic pipeline leveraging bounding boxes to generate unambiguous training data. Additionally, we propose two region-focused metrics, CLIP-Region Class (CLIP-RC) and CLIP-Global Context (CLIP-GC), to independently assess how well the source object is edited and the background is preserved, respectively. Experimental results demonstrate that region-wise diffusion improves existing baselines, and our data generation pipeline further enhances these improvements.1
Anthology ID:
2025.tacl-1.54
Volume:
Transactions of the Association for Computational Linguistics, Volume 13
Month:
Year:
2025
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
1171–1185
Language:
URL:
https://preview.aclanthology.org/fix-opsupmap-display/2025.tacl-1.54/
DOI:
10.1162/tacl.a.34
Bibkey:
Cite (ACL):
Yeonjoon Jung, Seungtaek Choi, and Seung-won Hwang. 2025. Overcoming Source Object Grounding for Semantic Image Editing. Transactions of the Association for Computational Linguistics, 13:1171–1185.
Cite (Informal):
Overcoming Source Object Grounding for Semantic Image Editing (Jung et al., TACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-opsupmap-display/2025.tacl-1.54.pdf