Tiezheng Ge
2026
Unified Thinker: A General Reasoning Core for Image Generation
Sashuai Zhou | Qiang Zhou | Jijin Hu | Hanqing Yang | Yue Cao | Junpeng Ma | Yinchao Ma | Jun Song | Tiezheng Ge | Cheng Yu | Bo Zheng | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sashuai Zhou | Qiang Zhou | Jijin Hu | Hanqing Yang | Yue Cao | Junpeng Ma | Yinchao Ma | Jun Song | Tiezheng Ge | Cheng Yu | Bo Zheng | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning–execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation
Ziwei Huang | Ying Shu | Fanghao | Quanyu Long | Wenya Wang | Qiushi Guo | Tiezheng Ge | Leilei Gan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziwei Huang | Ying Shu | Fanghao | Quanyu Long | Wenya Wang | Qiushi Guo | Tiezheng Ge | Leilei Gan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model’s temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
2025
VC4VG: Optimizing Video Captions for Text-to-Video Generation
Yang Du | Zhuoran Lin | Kaiqiang Song | Biao Wang | Zhicheng Zheng | Tiezheng Ge | Bo Zheng | Qin Jin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yang Du | Zhuoran Lin | Kaiqiang Song | Biao Wang | Zhicheng Zheng | Tiezheng Ge | Bo Zheng | Qin Jin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code (https://github.com/qyr0403/VC4VG) to support further research.
Do not Abstain! Identify and Solve the Uncertainty
Jingyu Liu | JingquanPeng JingquanPeng | Xiaopeng Wu | Xubin Li | Tiezheng Ge | Bo Zheng | Yong Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingyu Liu | JingquanPeng JingquanPeng | Xiaopeng Wu | Xubin Li | Tiezheng Ge | Bo Zheng | Yong Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the widespread application of Large Language Models (LLMs) across various domains, they frequently exhibit overconfidence when encountering uncertain scenarios, yet existing solutions primarily rely on evasive responses (e.g., “I don’t know”) overlooks the opportunity of identifying and addressing the uncertainty to generate more satisfactory responses. To systematically investigate and improve LLMs’ ability of recognizing and addressing the source of uncertainty, we introduce ConfuseBench, a benchmark mainly focus on three types of uncertainty: document scarcity, limited capability, and query ambiguity. Experiments with ConfuseBench reveal that current LLMs struggle to accurately identify the root cause of uncertainty and solve it. They prefer to attribute uncertainty to query ambiguity while overlooking capability limitations, especially for those weaker models. To tackle this challenge, we first generate context-aware inquiries that highlight the confusing aspect of the original query. Then we judge the source of uncertainty based on the uniqueness of the inquiry’s answer. Further we use an on-policy training method, InteractDPO to generate better inquiries. Experimental results demonstrate the efficacy of our approach.
2024
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
Yanan Wu | Jie Liu | Xingyuan Bu | Jiaheng Liu | Zhanhui Zhou | Yuanxing Zhang | Chenchen Zhang | Zhiqi Bai | Haibin Chen | Tiezheng Ge | Wanli Ouyang | Wenbo Su | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2024
Yanan Wu | Jie Liu | Xingyuan Bu | Jiaheng Liu | Zhanhui Zhou | Yuanxing Zhang | Chenchen Zhang | Zhiqi Bai | Haibin Chen | Tiezheng Ge | Wanli Ouyang | Wenbo Su | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systemically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we then evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models. Code is available at https://github.com/conceptmath/conceptmath.
E2-LLM: Efficient and Extreme Length Extension of Large Language Models
Jiaheng Liu | Zhiqi Bai | Yuanxing Zhang | Chenchen Zhang | Yu Zhang | Ge Zhang | Jiakai Wang | Haoran Que | Yukang Chen | Wenbo Su | Tiezheng Ge | Jie Fu | Wenhu Chen | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2024
Jiaheng Liu | Zhiqi Bai | Yuanxing Zhang | Chenchen Zhang | Yu Zhang | Ge Zhang | Jiakai Wang | Haoran Que | Yukang Chen | Wenbo Su | Tiezheng Ge | Jie Fu | Wenhu Chen | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2024
Training Large Language Models (LLMs) to process extensive context lengths incurs prohibitive computational costs. Prevailing techniques for extending context capabilities in LLMs typically require not only additional training procedures but also access to datasets with long context (e.g., sequences of 32K tokens), presupposing substantial GPU expenditures. To address the aforementioned issues, we introduce a novel solution named Efficient and Extreme length extension for Large Language Models (E2-LLM). E2-LLM entails a singular training process over considerably short sequences (e.g., 4K tokens), which greatly mitigates the cost of continual-pretraining or fine-tuning. Within the training phase, we incorporate a dual augmentation strategy with Rotary Position Embeddings (RoPE) that adjusts the scale and position indices across distinct training samples. E 2 -LLM is meticulously designed to enhance the model’s robustness to diverse relative positions. The experimental results on multiple benchmark datasets demonstrate the superior performance of E 2 -LLM on demanding tasks of processing long contexts.
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline
Dingyi Yang | Chunru Zhan | Ziheng Wang | Biao Wang | Tiezheng Ge | Bo Zheng | Qin Jin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingyi Yang | Chunru Zhan | Ziheng Wang | Biao Wang | Tiezheng Ge | Bo Zheng | Qin Jin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to share a story and attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes. In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip’s duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach. Our dataset, codes, and evaluations will be released.
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Ge Bai | Jie Liu | Xingyuan Bu | Yancheng He | Jiaheng Liu | Zhanhui Zhou | Zhuoran Lin | Wenbo Su | Tiezheng Ge | Bo Zheng | Wanli Ouyang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ge Bai | Jie Liu | Xingyuan Bu | Yancheng He | Jiaheng Liu | Zhanhui Zhou | Zhuoran Lin | Wenbo Su | Tiezheng Ge | Bo Zheng | Wanli Ouyang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at https://github.com/mtbench101/mt-bench-101.
2022
CapOnImage: Context-driven Dense-Captioning on Image
Yiqi Gao | Xinglin Hou | Yuanmeng Zhang | Tiezheng Ge | Yuning Jiang | Peng Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Yiqi Gao | Xinglin Hou | Yuanmeng Zhang | Tiezheng Ge | Yuning Jiang | Peng Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from theimage in presentation. However, texts can also be used as decorations on the image to highlight the key points and increase theattractiveness of images. In this work, we introduce a new taskcalled captioning on image (CapOnImage), which aims to generatedense captions at different locations of the image based on contextual information. To fully exploit the surrounding visual context togenerate the most suitable caption for each location, we propose amulti-modal pre-training model with multi-level pre-training tasksthat progressively learn the correspondence between texts and image locations from easy to difficult. Since the model may generateredundant captions for nearby locations, we further enhance thelocation embedding with neighbor locations as context. For thisnew task, we also introduce a large-scale benchmark called CapOnImage2M, which contains 2.1 million product images, each with anaverage of 4.8 spatially localized captions. Compared with other image captioning model variants, our model achieves the best resultsin both captioning accuracy and diversity aspects.
Search
Fix author
Co-authors
- Bo Zheng 4
- Jiaheng Liu 3
- Wenbo Su 3
- Bo Zheng 3
- Zhiqi Bai 2
- Xingyuan Bu 2
- Qin Jin 2
- Zhuoran Lin 2
- Wanli Ouyang 2
- Biao Wang 2
- Chenchen Zhang 2
- Yuanxing Zhang 2
- Zhanhui Zhou 2
- Ge Bai 1
- Yue Cao 1
- Haibin Chen 1
- Wenhu Chen 1
- Yukang Chen 1
- Yang Du 1
- Fanghao 1
- Jie Fu 1
- Leilei Gan 1
- Yiqi Gao 1
- Qiushi Guo 1
- Yancheng He 1
- Xinglin Hou 1
- Jijin Hu 1
- Ziwei Huang 1
- Yuning Jiang 1
- JingquanPeng JingquanPeng 1
- Xubin Li 1
- Jie Liu 1
- Jie Liu 1
- Jingyu Liu 1
- Yong Liu 1
- Quanyu Long 1
- Junpeng Ma 1
- Yinchao Ma 1
- Haoran Que 1
- Ying Shu 1
- Jun Song 1
- Kaiqiang Song 1
- Jiakai Wang 1
- Peng Wang 1
- Wenya Wang 1
- Ziheng Wang 1
- Xiaopeng Wu 1
- Yanan Wu 1
- Dingyi Yang 1
- Hanqing Yang 1
- Cheng Yu 1
- Chunru Zhan 1
- Ge Zhang 1
- Yu Zhang 1
- Yuanmeng Zhang 1
- Zhou Zhao 1
- Zhicheng Zheng 1
- Qiang Zhou (周强) 1
- Sashuai Zhou 1