Cheng Yu
2026
Unified Thinker: A General Reasoning Core for Image Generation
Sashuai Zhou | Qiang Zhou | Jijin Hu | Hanqing Yang | Yue Cao | Junpeng Ma | Yinchao Ma | Jun Song | Tiezheng Ge | Cheng Yu | Bo Zheng | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sashuai Zhou | Qiang Zhou | Jijin Hu | Hanqing Yang | Yue Cao | Junpeng Ma | Yinchao Ma | Jun Song | Tiezheng Ge | Cheng Yu | Bo Zheng | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning–execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
2025
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
Juncheng Wang | Chao Xu | Cheng Yu | Zhe Hu | Haoyu Xie | Guoqi Yu | Lei Shang | Shujun Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Juncheng Wang | Chao Xu | Cheng Yu | Zhe Hu | Haoyu Xie | Guoqi Yu | Lei Shang | Shujun Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren opens a promising pathway toward unified multi-modal generation frameworks.
2022
Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech
Yang Li | Cheng Yu | Guangzhi Sun | Hua Jiang | Fanglei Sun | Weiqin Zu | Ying Wen | Yang Yang | Jun Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yang Li | Cheng Yu | Guangzhi Sun | Hua Jiang | Fanglei Sun | Weiqin Zu | Ying Wen | Yang Yang | Jun Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past and future sentences. At inference time, instead of the standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information, which allows the prosody features generated by the TTS system to be related to the context and is more similar to how humans naturally produce prosody. The performance of CUC-VAE is evaluated via a qualitative listening test for naturalness, intelligibility and quantitative measurements, including word error rates and the standard deviation of prosody attributes. Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.
2016
Pairwise FastText Classifier for Entity Disambiguation
Cheng Yu | Bing Chu | Rohit Ram | James Aichinger | Lizhen Qu | Hanna Suominen
Proceedings of the Australasian Language Technology Association Workshop 2016
Cheng Yu | Bing Chu | Rohit Ram | James Aichinger | Lizhen Qu | Hanna Suominen
Proceedings of the Australasian Language Technology Association Workshop 2016
2014
Search
Fix author
Co-authors
- James Aichinger 1
- Yue Cao 1
- Bing Chu 1
- Tiezheng Ge 1
- Jijin Hu 1
- Zhe Hu 1
- Hua Jiang 1
- Yang Li 1
- Chunhua Liu 1
- Junpeng Ma 1
- Yinchao Ma 1
- Lizhen Qu 1
- Qin Qu 1
- Rohit Ram 1
- Lei Shang 1
- Jun Song 1
- Fanglei Sun 1
- Guangzhi Sun 1
- Hanna Suominen 1
- Gongbo Tang 1
- Yue Tian 1
- Jun Wang 1
- Juncheng Wang 1
- Shujun Wang 1
- Ying Wen 1
- Haoyu Xie 1
- Chao Xu 1
- Hanqing Yang 1
- Yang Yang 1
- Jing Yi 1
- Dong Yu (于东) 1
- Guoqi Yu 1
- Zhou Zhao 1
- Bo Zheng 1
- Qiang Zhou (周强) 1
- Sashuai Zhou 1
- Weiqin Zu 1