Rongrong Ji


2026

Factual knowledge stored in Large Language Models (LLMs) inevitably becomes outdated or erroneous over time, making it critical to update these models without incurring the high cost of retraining. Existing sequential knowledge editing methods predominantly rely on strict orthogonal projection to preserve previously edited knowledge. However, this excessive constraint limits gradient expressiveness, resulting in a significant degradation of model generalization and overall performance as the number of edits increases. To address this challenge, we propose Dual-Importance Projection Editing (DipEdit). This method leverages Singular Value Decomposition (SVD) to identify critical gradient subspaces and introduces a dual mechanism comprising "accumulated importance" and "projection importance." Unlike traditional approaches that enforce strict orthogonality, DipEdit dynamically scales gradient components parallel to key subspaces based on their projection importance rather than discarding them directly. This approach enhances the model’s adaptability to new knowledge while maximally preserving historical knowledge. Extensive experiments conducted on five mainstream LLMs using the ZsRE and Counterfact datasets demonstrate that DipEdit effectively handles thousands of sequential edits. The proposed method achieves an average comprehensive performance improvement of 10.36% and effectively maintains the model’s general capabilities on downstream tasks. Code is available at: https://github.com/czhhhla/DipEdit.
Algorithm Visualization (AV) helps students build mental models by animating algorithm execution states. Recent LLM-based systems such as CODE2VIDEO generate AV videos in an end-to-end manner. However, this paradigm requires the system to simultaneously simulate algorithm flow and satisfy video rendering constraints (element layout, color schemes, etc.), a complex task that induces LLM hallucinations. This results in reduced execution success rates, element overlap, and inter-frame inconsistencies.To address these challenges, we propose ALGOGEN, a novel paradigm that decouples algorithm execution from rendering. We first introduce Visualization Trace Algebra (VTA), a monoid over algorithm visual states and operations. The LLM then generates a Python tracker that simulates algorithm flow and outputs VTA-JSON traces, a JSON encoding of VTA. For rendering, we define a Rendering Style Language (RSL) to templatize algorithm layouts. A deterministic renderer then compiles algorithm traces with RSL into Manim, LaTeX/TikZ, or Three.js outputs[Manim, TikZ, and Three.js are respectively a Python animation engine, a LaTeX vector graphics package, and a JavaScript 3D rendering library.].Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves an average success rate improvement of 17.3% compared to end-to-end methods (99.8% vs. 82.5%). These results demonstrate that our decoupling paradigm effectively mitigates LLM hallucinations in complex AV tasks, providing a more reliable solution for automated generation of high-quality algorithm visualizations. Demo videos and code are available at: .
Large Language Models (LLMs) demonstrate strong generation and reasoning abilities, but they still face challenges in long-term memory retention and multi-turn conversational consistency. Existing memory-augmented methods often incorporate full dialog histories without filtering, resulting in information redundancy and inference latency. Inspired by the episodic memory mechanism in human cognition, we abstract conversational context into Episodic Memory Units (EMUs). We then propose a comprehensive framework, Episodic Memory Agent (EMA), along with a filtering decision module called MemDecider. Specifically, EMA organizes and retrieves EMUs to support response generation, while MemDecider filters information to reduce noise and improve overall performance. Experiments on two widely-used benchmarks show that EMA maintains competitive performance, and integrating MemDecider into other methods reduces their token consumption by an average of 11.48% while effectively improving the overall performance. Code is available at https://github.com/Hongyi4221/EMA.

2025

The Mixture of Experts (MoE) architecture enables efficient model scaling through conditional computation, where only subset of parameters are activated per input. However, this distributed architecture poses unprecedented challenges for model compression, as conventional quantization methods optimized for dense networks prove inadequate. This paper introduces a specialized quantization framework for MoE architectures, motivated by our discovery that weight matrices across expert networks exhibit distinctive channel-wise outlier distributions, necessitating a more nuanced compression approach. Through theoretical analysis incorporating Fisher Information matrices and condition number characteristics, we establish a fundamental relationship between layer functionality and quantization sensitivity, demonstrating that down-projection layers inherently demand higher precision compared to up-projection layers. Leveraging these insights, we develop an automated channel-wise quantization framework that dynamically determines optimal bit-width allocations while maintaining minimal computational overhead through efficient statistical approximations. When evaluated on the Mixtral-8x7b-v0.1 architecture, our methodology demonstrates a 3.96% improvement over existing state-of-the-art approaches across natural language understanding benchmarks, while achieving superior compression ratios.
Hypernetworks are a class of meta-networks that generate weights for main neural networks. Their unique parameter spaces necessitate exploring suitable optimization strategies to enhance performance, especially for language models. However, a comprehensive investigation into optimization strategies for hypernetworks remains absent. To address this gap, we analyze the loss landscape of hypernetworks and propose that restart optimization strategies can improve their performance for language models. We find that hypernetworks have inherently more complicated loss landscapes compared to conventional networks due to their distinct parameter spaces. Consequently, a restart strategy that periodically resets the learning rate can facilitate better convergence for hypernetworks. Through experiments on instruction tuning and multi-task training, we demonstrate that the restart strategy consistently enhances the performance of hypernetworks for language models, often more effectively than for conventional deep neural networks. Our findings highlight the importance of tailored optimization techniques to unlock the full potential of hypernetworks in natural language processing tasks.
Jailbreak attacks craft specific prompts or append adversarial suffixes to prompts, thereby inducing language models to generate harmful or unethical content and bypassing the model’s safety guardrails. With the recent blossom of large language models (LLMs), there’s a growing focus on jailbreak attacks to probe their safety. While current white-box attacks typically focus on meticulously identifying adversarial suffixes for specific models, their effectiveness and efficiency diminish when applied to different LLMs. In this paper, we propose a Monte Carlo Tree Search (MCTS) based Prompt Auto-generation (MPA) method to enhance the effectiveness and efficiency of attacks across various models. MPA automatically searches for and generates adversarial suffixes for valid jailbreak attacks. Specifically, we first identify a series of action candidates that could potentially trick LLMs into providing harmful responses. To streamline the exploration of adversarial suffixes, we design a prior confidence probability for each MCTS node. We then iteratively auto-generate adversarial prompts using the MCTS framework. Extensive experiments on multiple open-source models (like Llama, Gemma, and Mistral) and closed-source models (such as ChatGPT) show that our proposed MPA surpasses existing methods in search efficiency as well as attack effectiveness. The codes are available at https://github.com/KDEGroup/MPA.
While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose __Sequential Chunk-wise Optimization (SeCO)__, a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk’s forward activations are stored in memory. Building on SeCO, we further introduce __Sparse Chunk-wise Optimization (SpaCO)__, which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed—achieving up to 3× faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at https://anonymous.4open.science/r/seco-CCBD.

2024

Given the long textual product information and the product image, Multi-modal Product Summarization (MPS) aims to increase customers’ desire to purchase by highlighting product characteristics with a short textual summary. Existing MPS methods can produce promising results. Nevertheless, they still 1) lack end-to-end product summarization, 2) lack multi-grained multi-modal modeling, and 3) lack multi-modal attribute modeling. To improve MPS, we propose an end-to-end multi-grained multi-modal attribute-aware product summarization method (MMAPS) for generating high-quality product summaries in e-commerce. MMAPS jointly models product attributes and generates product summaries. We design several multi-grained multi-modal tasks to better guide the multi-modal learning of MMAPS. Furthermore, we model product attributes based on both text and image modalities so that multi-modal product characteristics can be manifested in the generated summaries. Extensive experiments on a real large-scale Chinese e-commence dataset demonstrate that our model outperforms state-of-the-art product summarization methods w.r.t. several summarization metrics. Our code is publicly available at: https://github.com/KDEGroup/MMAPS.
Code pre-trained language models (CPLMs) have received great attention since they can benefit various tasks that facilitate software development and maintenance. However, CPLMs are trained on massive open-source code, raising concerns about potential data infringement. This paper launches the study of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference (CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer deploys several inference techniques, including signal extraction from pre-training tasks, hard-to-learn sample calibration and weighted inference, to identify code membership status accurately. Extensive experiments show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer can serve as a CMI tool and help protect intellectual property rights. The implementation of Buzzer is available at: https://github.com/KDEGroup/Buzzer
This paper introduces AnyText, an all-encompassing framework for the task–In-Image Machine Translation (IIMT), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, diffusion models’ advanced inpainting and editing abilities make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the IIMT task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.

2016