Dazhen Deng


2026

Multi-round batch knowledge editing often suffers from performance degradation as edits accumulate. Focusing on the locate-then-edit paradigm, we analyze this phenomenon from a spectral perspective and identify a previously overlooked structural factor: the intrinsic knowledge of the model and historical edit memories exhibit markedly different spectral characteristics and information distributions, yet are naively coupled and jointly inverted during editing. Based on this insight, we propose SpecEdit to improve the model editing from a spectral perspective. SpecEdit performs spectral decoupling to isolate editing-critical directions and reduce destructive coupling, followed by spectral-structure-aware information compensation and spectral fusion to construct a refined closed-form solution. The module integrates seamlessly into existing editing methods without altering their original optimization procedures. Experiments on multiple LLMs and editing methods show that SpecEdit consistently improves performance, demonstrating that modeling spectral structure is an effective, interpretable approach and a promising direction for future research.
Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts—expressed through indirect domain knowledge—are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies two-strategy obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets on GitHub.