Shunyu Liu
2026
Reasoning-Guided Exploration for Online DPO
Zetian Hu | Shunyu Liu | Ting-En Lin | Fei Huang | Yongbin Li | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2026
Zetian Hu | Shunyu Liu | Ting-En Lin | Fei Huang | Yongbin Li | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2026
Recent work has aimed to enhance the reasoning capabilities of language models, but these methods are often limited to domains with objectively verifiable answers. To overcome this limitation, we introduce Reasoning-Guided Exploration for Online DPO (RGE-DPO), a novel self-play framework designed to improve reasoning on general-domain data. RGE-DPO employs a dual-reward mechanism to evaluate responses by assessing: (1) reasoning quality using a self-rewarding rubric that provides structured evaluation of logical coherence, reasoning depth, and verification behaviors; and (2) response quality using an established reward model trained for aspects like helpfulness and correctness. These two orthogonal evaluation signals enable a comprehensive assessment of different response dimensions without conflating reasoning processes with response content. We then integrate these two evaluation signals based on a weighted ranking mechanism to construct the preference pairs, which ensures that responses with superior reasoning processes are preferred when response quality is comparable. Experiments demonstrate that RGE-DPO achieves substantial improvements in instruction-following benchmark while maintaining competitive performance on verifiable academic benchmarks.
RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
Sunzhu Li | Jiale Zhao | Huimin Ren | Zhenlin Wei | Yang Zhou | Jingwen Yang | Shunyu Liu | Kaike Zhang | Chen Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sunzhu Li | Jiale Zhao | Huimin Ren | Zhenlin Wei | Yang Zhou | Jingwen Yang | Shunyu Liu | Kaike Zhang | Chen Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale (110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5.
2025
Dynamic Parallel Tree Search for Efficient LLM Reasoning
Yifu Ding | Wentao Jiang | Shunyu Liu | Yongcheng Jing | Jinyang Guo | Yingjie Wang | Jing Zhang | Zengmao Wang | Ziwei Liu | Bo Du | Xianglong Liu | Dacheng Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifu Ding | Wentao Jiang | Shunyu Liu | Yongcheng Jing | Jinyang Guo | Yingjie Wang | Jing Zhang | Zengmao Wang | Ziwei Liu | Bo Du | Xianglong Liu | Dacheng Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tree of Thoughts (ToT) enhances Large Language Model (LLM) reasoning by structuring problem-solving as a spanning tree. However, recent methods focus on search accuracy while overlooking computational efficiency. The challenges of accelerating the ToT lie in the frequent switching of reasoning focus, and the redundant exploration of suboptimal solutions. To alleviate this dilemma, we propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference. It includes the Parallelism Streamline in the generation phase to build up a flexible and adaptive parallelism with arbitrary paths by cache management and alignment. Meanwhile, the Search and Transition Mechanism filters potential candidates to dynamically maintain the reasoning focus on more possible solutions with less redundancy. Experiments on Qwen-2.5 and Llama-3 on math and code datasets show that DPTS significantly improves efficiency by 2-4× on average while maintaining or even surpassing existing reasoning algorithms in accuracy, making ToT-based reasoning more scalable and computationally efficient. Codes are released at: https://github.com/yifu-ding/DPTS.
Supervised Optimism Correction: Be Confident When LLMs Are Sure
Junjie Zhang | Rushuai Yang | Shunyu Liu | Ting-En Lin | Fei Huang | Yi Chen | Yongbin Li | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2025
Junjie Zhang | Rushuai Yang | Shunyu Liu | Ting-En Lin | Fei Huang | Yi Chen | Yongbin Li | Dacheng Tao
Findings of the Association for Computational Linguistics: ACL 2025
In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit Q-function for inference.Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated Q-value estimations of suboptimal steps. To address this limitation, we propose **S**upervised **O**ptimism **C**orrection (SOC), which introduces a simple yet effective auxiliary loss for token-level Q-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularizationto boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses.Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.
2024
A Regularization-based Transfer Learning Method for Information Extraction via Instructed Graph Decoder
Kedi Chen | Jie Zhou | Qin Chen | Shunyu Liu | Liang He
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Kedi Chen | Jie Zhou | Qin Chen | Shunyu Liu | Liang He
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Information extraction (IE) aims to extract complex structured information from the text. Numerous datasets have been constructed for various IE tasks, leading to time-consuming and labor-intensive data annotations. Nevertheless, most prevailing methods focus on training task-specific models, while the common knowledge among different IE tasks is not explicitly modeled. Moreover, the same phrase may have inconsistent labels in different tasks, which poses a big challenge for knowledge transfer using a unified model. In this study, we propose a regularization-based transfer learning method for IE (TIE) via an instructed graph decoder. Specifically, we first construct an instruction pool for datasets from all well-known IE tasks, and then present an instructed graph decoder, which decodes various complex structures into a graph uniformly based on corresponding instructions. In this way, the common knowledge shared with existing datasets can be learned and transferred to a new dataset with new labels. Furthermore, to alleviate the label inconsistency problem among various IE tasks, we introduce a task-specific regularization strategy, which does not update the gradients of two tasks with ‘opposite direction’. We conduct extensive experiments on 12 datasets spanning four IE tasks, and the results demonstrate the great advantages of our proposed method.
Let’s Rectify Step by Step: Improving Aspect-based Sentiment Analysis with Diffusion Models
Shunyu Liu | Jie Zhou | Qunxi Zhu | Qin Chen | Qingchun Bai | Jun Xiao | Liang He
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Shunyu Liu | Jie Zhou | Qunxi Zhu | Qin Chen | Qingchun Bai | Jun Xiao | Liang He
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Aspect-Based Sentiment Analysis (ABSA) stands as a crucial task in predicting the sentiment polarity associated with identified aspects within text. However, a notable challenge in ABSA lies in precisely determining the aspects’ boundaries (start and end indices), especially for long ones, due to users’ colloquial expressions. We propose DiffusionABSA, a novel diffusion model tailored for ABSA, which extracts the aspects progressively step by step. Particularly, DiffusionABSA gradually adds noise to the aspect terms in the training process, subsequently learning a denoising process that progressively restores these terms in a reverse manner. To estimate the boundaries, we design a denoising neural network enhanced by a syntax-aware temporal attention mechanism to chronologically capture the interplay between aspects and surrounding text. Empirical evaluations conducted on eight benchmark datasets underscore the compelling advantages offered by DiffusionABSA when compared against robust baseline models. Our code is publicly available at https://github.com/Qlb6x/DiffusionABSA.
Search
Fix author
Co-authors
- Dacheng Tao 3
- Qin Chen 2
- Liang He 2
- Yongbin Li 2
- Ting-En Lin 2
- Jie Zhou 2
- Qingchun Bai 1
- Kedi Chen 1
- Yi Chen 1
- Yifu Ding 1
- Bo Du 1
- Jinyang Guo 1
- Zetian Hu 1
- Fei Huang 1
- Fei Huang 1
- Wentao Jiang 1
- Yongcheng Jing 1
- Sunzhu Li 1
- Ziwei Liu 1
- Xianglong Liu 1
- Huimin Ren 1
- Yingjie Wang 1
- Zengmao Wang 1
- Zhenlin Wei 1
- Chen Wei 1
- Jun Xiao 1
- Rushuai Yang 1
- Jingwen Yang 1
- Jing Zhang 1
- Junjie Zhang 1
- Kaike Zhang 1
- Jiale Zhao 1
- Yang Zhou 1
- Qunxi Zhu 1