UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
Zhen Fang, Ruiyan Han, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao
Abstract
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.- Anthology ID:
- 2026.acl-long.621
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13638–13669
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.621/
- DOI:
- Cite (ACL):
- Zhen Fang, Ruiyan Han, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, and Feng Zhao. 2026. UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13638–13669, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision (Fang et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.621.pdf