2025
pdf
bib
abs
K-order Ranking Preference Optimization for Large Language Models
Shihao Cai
|
Chongming Gao
|
Yang Zhang
|
Wentao Shi
|
Jizhi Zhang
|
Keqin Bao
|
Qifan Wang
|
Fuli Feng
Findings of the Association for Computational Linguistics: ACL 2025
To adapt large language models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities.However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO’s Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise. The code is available at https://github.com/Lanyu0303/KPO.
pdf
bib
abs
HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning
Xiaoyuan Li
|
Moxin Li
|
Rui Men
|
Yichang Zhang
|
Keqin Bao
|
Wenjie Wang
|
Fuli Feng
|
Dayiheng Liu
|
Junyang Lin
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.
pdf
bib
abs
Customizing In-context Learning for Dynamic Interest Adaption in LLM-based Recommendation
Keqin Bao
|
Ming Yan
|
Yang Zhang
|
Jizhi Zhang
|
Wenjie Wang
|
Fuli Feng
|
Xiangnan He
Findings of the Association for Computational Linguistics: ACL 2025
Frequently updating Large Language Model (LLM)-based recommender systems to adapt to dynamic user interests—as done for traditional ones—is impractical due to high training costs, even with acceleration methods. This work explores the possibility of adapting the model to dynamic user interests without any model-level updates via In-context Learning (ICL), which enables adaptation through few-shot examples within input prompts. While using recent user interactions as ICL demonstrations offers a potential solution for dynamic interest adaptation, existing LLM-based recommenders face critical limitations: recommendation-specific tuning often diminishes the model’s in-context learning ability, and the original LLM’s ICL lacks task-specific optimization for recommendations. To bridge this gap, we introduce RecICL, a framework that establishes recommendation-oriented in-context learning by structuring recent user interactions and current inputs into ICL formats. RecICL achieves dual objectives: (1) preserving fundamental ICL capabilities during recommendation adaptation and (2) dynamically capturing user preference evolution through the most recent interactions. Extensive experiments across multiple benchmarks demonstrate RecICL’s superior performance, achieving better results without model updates. Our implementation is publicly available at
https://anonymous.4open.science/r/RecICL-8003.
2024
pdf
bib
abs
Text-like Encoding of Collaborative Information in Large Language Models for Recommendation
Yang Zhang
|
Keqin Bao
|
Ming Yan
|
Wenjie Wang
|
Fuli Feng
|
Xiangnan He
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
When adapting Large Language Models for Recommendation (LLMRec), it is crucial to integrate collaborative information. Existing methods achieve this by learning collaborative embeddings in LLMs’ latent space from scratch or by mapping from external models. However, they fail to represent the information in a text-like format, which may not align optimally with LLMs. To bridge this gap, we introduce BinLLM, a novel LLMRec method that seamlessly integrates collaborative information through text-like encoding. BinLLM converts collaborative embeddings from external models into binary sequences — a specific text format that LLMs can understand and operate on directly, facilitating the direct usage of collaborative information in text-like format by LLMs. Additionally, BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths. Extensive experiments validate that BinLLM introduces collaborative information in a manner better aligned with LLMs, resulting in enhanced performance. We release our code at https://github.com/zyang1580/BinLLM.
pdf
bib
abs
GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation
Shihao Cai
|
Keqin Bao
|
Hangyu Guo
|
Jizhi Zhang
|
Jun Song
|
Bo Zheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models have seen widespread adoption in math problem-solving, yet for geometry problems, which often necessitate visual aids even for humans, the most advanced multi-modal models still struggle to effectively utilize image information. High-quality data is crucial for enhancing the geometric capabilities of multi-modal models, yet existing open-source datasets and related efforts are either too challenging for direct model learning or suffer from misalignment between text and images. To overcome this issue, we introduce a novel pipeline that leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images, facilitating model learning. We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V dataset. Experimental results demonstrate that the GeoGPT4V dataset significantly improves the geometry performance of various models on the MathVista and MathVision benchmarks. The code is available at https://anonymous.4open.science/r/GeoGPT4V-08B2.
pdf
bib
abs
Decoding Matters: Addressing Amplification Bias and Homogeneity Issue in Recommendations for Large Language Models
Keqin Bao
|
Jizhi Zhang
|
Yang Zhang
|
Xinyue Huo
|
Chong Chen
|
Fuli Feng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Adapting Large Language Models (LLMs) for recommendation requires careful consideration of the decoding process, given the inherent differences between generating items and natural language. Existing approaches often directly apply LLMs’ original decoding methods. However, we find these methods encounter significant challenges: 1) amplification bias—where standard length normalization inflates scores for items containing tokens with generation probabilities close to 1 (termed ghost tokens), and 2) homogeneity issue—generating multiple similar or repetitive items for a user. To tackle these challenges, we introduce a new decoding approach named Debiasing-Diversifying Decoding (D3). D3 disables length normalization for ghost tokens to alleviate amplification bias, and it incorporates a text-free assistant model to encourage tokens less frequently generated by LLMs for counteracting recommendation homogeneity. Extensive experiments on real-world datasets demonstrate the method’s effectiveness in enhancing accuracy and diversity.
2022
pdf
bib
abs
Alibaba-Translate China’s Submission for WMT2022 Metrics Shared Task
Yu Wan
|
Keqin Bao
|
Dayiheng Liu
|
Baosong Yang
|
Derek F. Wong
|
Lidia S. Chao
|
Wenqiang Lei
|
Jun Xie
Proceedings of the Seventh Conference on Machine Translation (WMT)
In this report, we present our submission to the WMT 2022 Metrics Shared Task. We build our system based on the core idea of UNITE (Unified Translation Evaluation), which unifies source-only, reference-only, and source- reference-combined evaluation scenarios into one single model. Specifically, during the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE. Notably, to reduce the gap between pre-training and fine-tuning, we use data cropping and a ranking-based score normalization strategy. During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years’ WMT competitions. Specially, we collect the results from models with different pre-trained language model backbones, and use different ensembling strategies for involved translation directions.
pdf
bib
abs
Alibaba-Translate China’s Submission for WMT 2022 Quality Estimation Shared Task
Keqin Bao
|
Yu Wan
|
Dayiheng Liu
|
Baosong Yang
|
Wenqiang Lei
|
Xiangnan He
|
Derek F. Wong
|
Jun Xie
Proceedings of the Seventh Conference on Machine Translation (WMT)
In this paper, we present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE (Unified Translation Evaluation). Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model. First, we apply the pseudo-labeled data examples for the continuously pre-training phase. Notably, to reduce the gap between pre-training and fine-tuning, we use data cropping and a ranking-based score normalization strategy. For the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years’ WMT competitions. Finally, we collect the source-only evaluation results, and ensemble the predictions generated by two UniTE models, whose backbones are XLM-R and infoXLM, respectively. Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings, showing relatively strong performances in this year’s quality estimation competition.