Wei Ju


2026

This paper investigates the problem of safe decoding for Large Language Models (LLMs) during inference, particularly under jailbreak attacks. Previous approaches typically either detect malicious content or regulate the decoding alignment of LLMs to mitigate such attacks. Although effective in defending against attacks, these methods often over-reject benign content, limiting their generalizability in real-world scenarios where harmful and benign information coexist. Towards this end, we propose an innovative framework named Sequence-level risk Accumulation for calibrating test-time alignment (SEAT). Specifically, SEAT introduces a reward-guided branch decoding paradigm to incorporate safety awareness during generation. To balance the detection of harmful content with the accurate response to benign information, SEAT employs a sequence-level risk monitor that smooths risk signals over the entire sequence, preventing over-confident refusals for certain tokens. Furthermore, we conduct extensive experiments on four attack benchmarks and two neutral datasets, comparing SEAT with eight state-of-the-art baselines. Consequently, the results demonstrate that SEAT achieves superior performance both in defending against jailbreak attacks and in generating high-quality responses on neutral datasets. Our code is available at https://github.com/ShanwenTan/SEAT.
This paper studies the problem of test-time adaptation for vision-language models (VLMs). Recent approaches typically measure the prediction entropy to store a confident cache for logit refinement. However, these confident samples tend to approach prototypes with limited coverage of data distribution, which could result in biased predictions as the distribution evolves. Towards this end, we propose a novel approach named Diversity-attended Dynamic Caching with Asymmetric Quantization (DANCE) for test-time adaptation of VLMs. The core of our DANCE is to maintain a dynamic cache to store diversity-aware test samples, which support efficient logit adjustment via asymmetric quantization. In particular, we first generate multiple augmented views of each sample and aggregate their outputs from pre-trained VLMs via a consistency-aware mechanism. More importantly, we construct a dynamic cache, which stores the most reliable and diverse samples to cover evolving test distributions. To measure the diversity efficiently, we quantize cached samples and compute the asymmetric similarity across query samples and memory samples, which guide the cache updating via replacing samples with the lowest scores iteratively. Finally, we leverage the asymmetric similarity between the quantized prototype representations from the dynamic cache to update logits under distribution shifts. Extensive experiments on various benchmark datasets validate the superiority of the proposed DANCE in different settings.

2025

Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our observations provide new insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
This paper studies the problem of text-attributed hypergraph self-supervised representation learning, which aims to generate discriminative representations of hypergraphs without any annotations for downstream tasks. However, real-world hypergraphs could contain incomplete signals, which could deteriorate the representation learning procedure, especially under label scarcity. Towards this end, we introduce a new perspective that leverages large language models to enhance hypergraph self-supervised learning and propose a novel data-centric approach named Hybrid Hypergraph Enhancement with LLM-based Agents (HEAL). The core of our HEAL is to generate informative nodes and hyperedges through multi-round interaction with LLM-based agents. In particular, we first retrieve similar samples for each node to facilitate the node expansion agent for different views. To generate challenging samples, we measure the gradients for each augmented view and select the most informative one using an evaluation agent. From the structural view, we adopt a topology refinement agent to incorporate new hyperedges for the recovery of missing structural signals. The enhanced hypergraphs would be incorporated into a self-supervised learning framework for discriminative representations. Extensive experiments on several datasets validate the effectiveness of our HEAL in comparison with extensive baselines.
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEVALPRO, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEVALPRO comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEVALPRO is **more challenging** (the best LMM lags behind human performance by 31.73%, compared to an average gap of 8.03% in previous benchmarks) and **more trustworthy** (the best LLM trails the best LMM by 23.09%, whereas the gap for previous benchmarks is just 14.64%). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.
Supervised fine-tuning (SFT) is crucial in adapting large language models (LLMs) to a specific domain or task. However, only a limited amount of labeled data is available in practical applications, which poses a severe challenge for SFT in yielding satisfactory results. Therefore, a data-efficient framework that can fully exploit labeled and unlabeled data for LLM fine-tuning is highly anticipated.Towards this end, we introduce a **semi-supervised fine-tuning (SemiFT)** task and a framework named **SemiEvol** for LLM alignment from a propagate-and-select manner. For knowledge propagation, SemiEvol adopts a bi-level approach, propagating knowledge from labeled data to unlabeled data through both in-weight and in-context methods. For knowledge selection, SemiEvol incorporates a collaborative learning mechanism, selecting higher-quality pseudo-response samples. We conducted experiments using GPT-4o-mini and Llama-3.1 on seven general or domain-specific datasets, demonstrating significant improvements in model performance on target data. Furthermore, we compared SemiEvol with SFT and self-evolution methods, highlighting its practicality in hybrid data scenarios. Github Repository: [https://github.com/luo-junyu/SemiEvol](https://github.com/luo-junyu/SemiEvol).
Traffic flow forecasting aims to predict future traffic flows based on historical traffic conditions and the road network. It is an important problem in intelligent transportation systems, with a plethora of methods being proposed. Existing efforts mainly focus on capturing and utilizing spatio-temporal dependencies to predict future traffic flows. Though promising, they fall short in adapting to test-time environmental changes in traffic conditions. To tackle this challenge, we propose to introduce large language models (LLMs) to help traffic flow forecasting and design a novel method named Large Language Model Enhanced Traffic Flow Predictor (LEAF). LEAF adopts two branches, capturing different spatio-temporal relations using graph and hypergraph structures, respectively. The two branches are first pre-trained individually, and during test time, they yield different predictions. Based on these predictions, a large language model is used to select the most likely result. Then, a ranking loss is applied as the learning objective to enhance the prediction ability of the two branches. Extensive experiments on several datasets demonstrate the effectiveness of LEAF. Our code is available at https://github.com/YushengZhao/LEAF.
This paper studies the problem of time series forecasting, which aims to generate future predictions given historical trajectories. Recent researchers have applied large language models (LLMs) into time series forecasting, which usually align the time series space with textual space and output future predictions with strong autoregressive reasoning abilities. Despite their remarkable progress, these approaches usually lack an understanding of holistic temporal patterns with potential error accumulation. Towards this end, this paper proposes a simple yet effective framework that marries  ̲Larg ̲e Langu ̲age Diffusion Model with time series  ̲forecasting (LEAF). The core of our framework is to generate future predictions with a diffusion model from a holistic view. In particular, we first introduce a tokenization module to convert time series into tokens and then adopt the language diffusion models to capture the temporal dependencies. In this way, we can transform masked time series into all the predictions with the remasking strategy. Extensive experiments on various benchmark datasets validate the effectiveness of the proposed LEAF in comparison to various baselines.
Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM