2025
pdf
bib
abs
Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization
Yuhao Wang
|
Keyan Ding
|
Kehua Feng
|
Zeyuan Wang
|
Ming Qin
|
Xiaotong Li
|
Qiang Zhang
|
Huajun Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and *denovo* design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.
pdf
bib
abs
EvolveBench: A Comprehensive Benchmark for Assessing Temporal Awareness in LLMs on Evolving Knowledge
Zhiyuan Zhu
|
Yusheng Liao
|
Zhe Chen
|
Yuhao Wang
|
Yunfeng Guan
|
Yanfeng Wang
|
Yu Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are trained on extensive historical corpora, but their ability to understand time and maintain temporal awareness of time-evolving factual knowledge remains limited. Previous studies often neglect the critical aspect of utilizing knowledge from various sources. To address this gap, we introduce EvolveBench, a comprehensive benchmark that evaluates temporal competence along five key dimensions: Cognition, which examines the ability to recall and contextualize historical facts. Awareness, which tests LLMs’ awareness of temporal misalignment between external inputs and the temporal context of a query. Trustworthiness, which assesses whether models can identify and appropriately refuse queries based on invalid timestamps. Understanding, which focuses on interpreting both explicit dates and implicit historical markers. Finally, reasoning evaluates the capacity to analyze temporal relationships and draw accurate inferences. Evaluating 15 widely used LLMs on EvolveBench shows that GPT-4o achieves the highest average EM score of 79.36, while the open-source Llama3.1-70B demonstrates notable strength in handling temporally misaligned contexts with an average score of 72.47. Despite these advances, all models still struggle with handling temporal misaligned context. Our code and dataset are available at https://github.com/zzysjtuiwct/EvolveBench.
pdf
bib
abs
LLMTreeRec: Unleashing the Power of Large Language Models for Cold-Start Recommendations
Wenlin Zhang
|
Chuhan Wu
|
Xiangyang Li
|
Yuhao Wang
|
Kuicai Dong
|
Yichao Wang
|
Xinyi Dai
|
Xiangyu Zhao
|
Huifeng Guo
|
Ruiming Tang
Proceedings of the 31st International Conference on Computational Linguistics
The lack of training data gives rise to the system cold-start problem in recommendation systems, making them struggle to provide effective recommendations. To address this problem, Large Language Models(LLMs) can model recommendation tasks as language analysis tasks and provide zero-shot results based on their vast open-world knowledge. However, the large scale of the item corpus poses a challenge to LLMs, leading to substantial token consumption that makes it impractical to deploy in real-world recommendation systems. To tackle this challenge, we introduce a tree-based LLM recommendation framework LLMTreeRec, which structures all items into an item tree to improve the efficiency of LLM’s item retrieval. LLMTreeRec achieves state-of-the-art performance under the system cold-start setting in two widely used datasets, which is even competitive with conventional deep recommendation systems that use substantial training data. Furthermore, LLMTreeRec outperforms the baseline model in the A/B test on Huawei industrial system. Consequently, LLMTreeRec demonstrates its effectiveness as an industry-friendly solution that has been successfully deployed online.
pdf
bib
abs
Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation
Pengyue Jia
|
Derong Xu
|
Xiaopeng Li
|
Zhaocheng Du
|
Xiangyang Li
|
Yichao Wang
|
Yuhao Wang
|
Qidong Liu
|
Maolin Wang
|
Huifeng Guo
|
Ruiming Tang
|
Xiangyu Zhao
Findings of the Association for Computational Linguistics: ACL 2025
The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, We first propose a rationale extraction method that leverages the reasoning capabilities of large language models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.
pdf
bib
abs
Breaking Down Power Barriers in On-Device Streaming ASR: Insights and Solutions
Yang Li
|
Yuan Shangguan
|
Yuhao Wang
|
Liangzhen Lai
|
Ernie Chang
|
Changsheng Zhao
|
Yangyang Shi
|
Vikas Chandra
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Power consumption plays a crucial role in on-device streaming speech recognition, significantly influencing the user experience. This study explores how the configuration of weight parameters in speech recognition models affects their overall energy efficiency. We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. Our approach, which adjusts model components based on their specific energy sensitivities, achieves up to 47% lower energy usage while preserving comparable model accuracy and improving real-time performance compared to leading methods.
2024
pdf
bib
abs
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception
Yuhao Wang
|
Yusheng Liao
|
Heyang Liu
|
Hongcheng Liu
|
Yanfeng Wang
|
Yu Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. However, these models also suffer from hallucinations, which limit their reliability as AI systems. We believe that these hallucinations are partially due to the models’ struggle with understanding what they can and cannot perceive from images, a capability we refer to as self-awareness in perception. Despite its importance, this aspect of MLLMs has been overlooked in prior studies. In this paper, we aim to define and evaluate the self-awareness of MLLMs in perception. To do this, we first introduce the knowledge quadrant in perception, which helps define what MLLMs know and do not know about images. Using this framework, we propose a novel benchmark, the Self-Awareness in Perception for MLLMs (MM-SAP), specifically designed to assess this capability. We apply MM-SAP to a variety of popular MLLMs, offering a comprehensive analysis of their self-awareness and providing detailed insights. The experiment results reveal that current MLLMs possess limited self-awareness capabilities, pointing to a crucial area for future advancement in the development of trustworthy MLLMs. Code and data are available at https://github.com/YHWmz/MM-SAP.