Multimodal Large Language Models for Multi-Subject In-Context Image Generation

Yucheng Zhou, Dubing Chen, Huan Zheng, Jianbing Shen


Abstract
Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities increases, existing methods often suffer from subject missing and semantic drift. To address this problem, we propose MUSIC, the first MLLM specifically designed for MUlti-Subject In-Context image generation. To overcome the data scarcity, we introduce an automatic and scalable data generation pipeline that eliminates the need for manual annotation. Furthermore, we enhance the model’s understanding of multi-subject semantic relationships through a vision chain-of-thought (CoT) mechanism, guiding step-by-step reasoning from subject images to semantics and generation. To mitigate identity entanglement and manage visual complexity, we develop a novel semantics-driven spatial layout planning method and demonstrate its test-time scalability. By incorporating complex subject images during training, we improve the model’s capacity for chained reasoning. In addition, we curate MSIC, a new benchmark tailored for multi-subject in-context generation. Experimental results demonstrate that MUSIC significantly surpasses other methods in both multi- and single-subject scenarios.
Anthology ID:
2026.acl-long.1518
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32880–32898
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1518/
DOI:
Bibkey:
Cite (ACL):
Yucheng Zhou, Dubing Chen, Huan Zheng, and Jianbing Shen. 2026. Multimodal Large Language Models for Multi-Subject In-Context Image Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32880–32898, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Multimodal Large Language Models for Multi-Subject In-Context Image Generation (Zhou et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1518.pdf
Checklist:
 2026.acl-long.1518.checklist.pdf