Bochen Pang
2025
Token-level Proximal Policy Optimization for Query Generation
Yichen Ouyang
|
Lu Wang
|
Fangkai Yang
|
Pu Zhao
|
Chenghua Huang
|
Jianfeng Liu
|
Bochen Pang
|
Yaming Yang
|
Yuefeng Zhan
|
Hao Sun
|
Qingwei Lin
|
Saravan Rajmohan
|
Weiwei Deng
|
Dongmei Zhang
|
Feng Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. We conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine, demonstrating that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.
Search
Fix author
Co-authors
- Weiwei Deng 1
- Chenghua Huang 1
- Qingwei Lin 1
- Jianfeng Liu 1
- Yichen Ouyang 1
- show all...