Peisong Wang

2026

Parameter-efficient fine-tuning (PEFT) has become a prevalent approach for adapting large language models (LLMs). However, low-rank adaptation methods face an inherent trade-off: improving target task performance can compromise pre-trained world knowledge, while aggressively constraining updates to preserve world knowledge may hinder improvements in the target task. Furthermore, most current methods fail to account for layer-wise differences in adaptation sensitivity, resulting in suboptimal preservation of world knowledge and task adaptation. To address these challenge, we propose Fisher-Optimized Adaptive Low Rank and Singular-VectorSelection (FARSS), an effective framework for knowledge-preserving fine-tuning. This framework introduces two key innovations. First, we propose a Fisher-guided adaptive rank allocation strategy, which assigns smaller ranks to shallow layers that are critical for preserving world knowledge, and larger ranks to deep layers that are essential for task adaptation. Second, we introduce a task-aware initialization method that integrates singular value information with layer-specific second-order statistics estimated from activation and gradient covariances, enabling efficient and task-sensitive low-rank updates. We evaluated several models across various tasks, and the experimental results show that our approach outperforms existing PEFT methods, including LoRA, Corda, and KaSA, achieving a balance between preserving world knowledge and enhancing target task performance. The code is available at https://github.com/chenyehuang/FARSS.

pdf bib abs

Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge.To bridge the gap, we introduce Sentient Agent as a Judge(SAGE), an automated evaluation framework that measures an LLM’s higher-order social cognition.SAGE instantiates a “Sentient Agent” – an LLM-powered agent that simulates human-like emotional changes and inner thoughts to provide a more realistic evaluation of the tested model in multi-turn conversations.At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. Human evaluation further demonstrates 85.3% consistency between the agent’s emotional reasoning and human judgments. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4×) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g. Arena). SAGE thus provides a principled, scalable, and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

2025

pdf bib abs

As the parameter size of language models becomes extremely large, fine-tuning them with limited resources has become a challenging task. Latest advancements in parameter-efficient fine-tuning (PEFT) techniques allow for adjustments to only a minor fraction of the parameters of these LLMs. Yet, most of PEFT methods may suffer from the following limitations: (1) As the rank decreases sharply, PEFT methods like LoRA and Adapter tuning will exhibit significant performance degradation in downstream tasks. (2) An accuracy gap between these methods and full fine-tuning (Full-FT) still exists. To tackle these problems, we propose a Low-Rank Direct Attention Adaptation (LoRaDA) method for efficient LLM fine-tuning. Specifically, we introduce a novel Low-rank Multi-head Attention Map Module (LMAM), which can bring negative attention to self-attention modules and learn low-rank attention weights directly, capturing the characteristics of downstream tasks. Furthermore, LMAM can serve as a plug-in to existing methods, such as LoRA and Adapter, providing state-of-the-art performance even with extreme low rank setting.Extensive experiments on various downstream tasks demonstrate the superior performance of our LoRaDA method. Specifically, LoRaDA even outperforms the full fine-tuning method by up to 2.1% on GLUE benchmark. As a plug-in, LMAM boosts the accuracy of LoRA by up to 27.7% with LLaMA-7B on Commonsense Reasoning benchmark.

pdf bib abs

RQT: Hierarchical Residual Quantization for Multi-Model Compression
Chen Tianqi | Peisong Wang | Weixiang Xu | Zeyu Zhu | Jian Cheng
Findings of the Association for Computational Linguistics: ACL 2025

Delta compression methods focus on efficiently serving multiple uniquely fine-tuned models, each tailored to specific tasks and user requirements. These approaches decompose a fine-tuned LLM into a base model and corresponding delta weights, which are compressed using low-rank or low-bit representations to reduce storage costs. However, their effectiveness is highly sensitive to the magnitude of the model deltas—a factor directly influenced by the scale of the training data. We propose the Residual Quantization Tree (RQT), a hierarchical quantization framework that automatically shares low-bit integer weights across similar fine-tuned models. The RQT construction employs a two-phase greedy algorithm: a bottom-up aggregation of models based on weight matrix similarity, and top-down residual quantization, in which each node optimizes the quantization parameters and then delegates residual errors to child nodes. We evaluate RQT on fine-tuned models across mathematics, coding, chatbot, and Chinese LLMs. The results show that RQT achieves an average accuracy degradation of approximately 3% (comparable to previous 4-bit post-training quantization) while maintaining an effective bitwidth of around 2 bits.

pdf bib abs

State Space Models (SSMs), such as Mamba, have recently demonstrated potential in language understanding tasks, positioning them as competitors to transformer architectures. However, our investigations reveal that the Mamba architecture still has room for further optimization—not only in linear projections but also in state caches, which contribute significantly to memory consumption, particularly after quantizing the former into low bits. After a theoretical analysis of the causes of outliers in states, we propose Decoupled Scale Quantization (DSQ), which mitigates outliers in both the state and channel dimensions by applying separate quantization scales. To preserve the selective ability of quantized Mamba, we introduce Efficient Selectivity Reconstruction (ESR), a novel quantization simulation scheme in block-wise reconstruction that enables fast parallel scan algorithms with the non-linear quantization function. We demonstrate the effectiveness of Q-Mamba across various quantization settings, model sizes, and both generation and zero-shot tasks. In particular, for Mamba2-2.7B with W8A8H4 (8-bit weights and activations, 4-bit state caches) quantization, Q-Mamba achieves a 50% reduction in memory consumption with only a 2.13% average accuracy degradation on zero-shot tasks.

pdf bib abs

EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
Yuanteng Chen | Yuantian Shao | Peisong Wang | Jian Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.

pdf bib abs

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs’ deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S²R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by outcome-level and process-level reinforcement learning with minimized resource requirements. Our results demonstrate that, with only 3.1k behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data. We also discuss the effect of different RL strategies on enhancing LLMs’ deep reasoning. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S²R.