Yu Cheng
Other people with similar names: Yu Cheng, Yu Cheng
Unverified author pages with similar names: Yu Cheng
2026
TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
Shichao Ma | Zhiyuan Ma | Ming Yang | Xiaofan Li | Xing Wu | Jintao Du | Yu Cheng | Weiqiang Wang | Qiliang Liu | Zhengyang Zhou | Yang Wang
Findings of the Association for Computational Linguistics: ACL 2026
Shichao Ma | Zhiyuan Ma | Ming Yang | Xiaofan Li | Xing Wu | Jintao Du | Yu Cheng | Weiqiang Wang | Qiliang Liu | Zhengyang Zhou | Yang Wang
Findings of the Association for Computational Linguistics: ACL 2026
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively. Code is available at https://github.com/Flipped-May/TSPO.
2025
Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
Chang Su | Dengliang Shi | Siyuan Huang | Jintao Du | Changhua Meng | Yu Cheng | Weiqiang Wang | Zhouhan Lin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chang Su | Dengliang Shi | Siyuan Huang | Jintao Du | Changhua Meng | Yu Cheng | Weiqiang Wang | Zhouhan Lin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as ‘[EOS]‘. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the ‘[EOS]‘ embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.