Enhancing User-Controlled Text-to-Image Generation with Layout-Aware Personalization

Hongliang Luo, Wei Xi


Abstract
Recent diffusion-based models have advanced text-to-image synthesis, yet struggle to preserve fine visual details and enable precise spatial control in personalized content. We propose **LayoutFlex**, a novel framework that combines a Perspective-Adaptive Feature Extraction system with a Spatial Control Mechanism. Our approach captures fine-grained details via cross-modal representation learning and attention refinement, while enabling precise subject placement through coordinate-aware attention and region-constrained optimization. Experiments show LayoutFlex outperforms prior methods in visual fidelity (DINO 10.8%) and spatial accuracy (AP 43.1±1.2 vs. 19.3). LayoutFlex supports both single and multi-subject personalization, offering a powerful solution for controllable and coherent image generation in creative and interactive applications.
Anthology ID:
2025.acl-long.1556
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32349–32364
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1556/
DOI:
Bibkey:
Cite (ACL):
Hongliang Luo and Wei Xi. 2025. Enhancing User-Controlled Text-to-Image Generation with Layout-Aware Personalization. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32349–32364, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Enhancing User-Controlled Text-to-Image Generation with Layout-Aware Personalization (Luo & Xi, ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1556.pdf