Mengqi Liao

2026

With reasoning becoming the generative paradigm for large language models, the memory bottleneck caused by KV cache during the inference phase has become a critical factor limiting high-concurrency service capabilities. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95% of the performance of Full KV inference engines while delivering over 2.1 speedup.

2025

pdf bib abs

Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model’s exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://anonymous.4open.science/r/E3-RL4LLMs-DB28

2024

pdf bib abs

KPatch: Knowledge Patch to Pre-trained Language Model for Zero-Shot Stance Detection on Social Media
Shuohao Lin | Wei Chen | Yunpeng Gao | Zhishu Jiang | Mengqi Liao | Zhiyu Zhang | Shuyuan Zhao | Huaiyu Wan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Zero-shot stance detection on social media (ZSSD-SM) aims to distinguish the attitude in tweets towards an unseen target. Previous work capture latent variables between source and target domains to perform this task, but the lack of context knowledge hinders the detection performance. Recent studies have been devoted to obtaining the accurate representation of tweets by bringing additional facts from Knowledge Graph (KG), showing promising performance. However, these knowledge injection methods still suffer from two challenges: (i) The pipeline of knowledge injection causes error accumulation and (ii) irrelevant knowledge makes them fail to understand the semantics. In this paper, we propose a novel knowledge injection method for ZSSD-SM, which adopts two training stages, namely knowledge compression and task guidance, to flexibly inject knowledge into the pre-trained language model (PLM) and adaptively expand tweets context. Specifically, in the knowledge compression stage, the latent representation of KG is reconstructed by the triplet denoising task and compressed into external matrices; while in the task guidance stage, the frozen matrices are employed to guide the PLM to adaptively extract its own context-related knowledge, and then complete the fine-tuning of the ZSSD-SM task. Extensive experiments on multiple datasets show the effectiveness of our proposed method. The code is available at: https://github.com/ShuohaoLin/KPatch.