Zhihui Fu

2026

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

pdf bib abs

POP: Prefill-Only Pruning for Efficient Large Model Inference
Junhui He | Zhihui Fu | Jun Wang | Qingan Li
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities.However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37× speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

pdf bib abs

Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., “Sure”), which does not guarantee sustained harmful output—the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.

pdf bib abs

Autoregressive (AR) decoding in large language models (LLMs) is latency-bounded by strictly sequential token generation.Speculative decoding mitigates this bottleneck by letting a fast drafter propose multi-token candidates that are then verified in parallel by the target model; yet most existing systems still rely on AR drafters, limiting wall-clock gains.We present **DiffuSpec**, which repurposes a *diffusion language model* (DLM) as a *parallel* drafter to generate multi-token proposals in a single forward pass while remaining compatible with standard AR verifiers.However, DLM drafting presents unique challenges: 1) bidirectional conditioning produces a token lattice where locally optimal tokens may fail to form a valid causal sequence; 2) the mechanism requires tuning the draft length, which induces a speed–quality trade-off. To address these issues, we introduce (i) *Causal-consistency Path Search* (CPS) to extract verifier-aligned causal paths from the lattice, and (ii) an *Adaptive Draft-Length* (ADL) controller that adjusts proposal lengths using online acceptance feedback.Across benchmarks, DiffuSpec achieves up to 3× wall-clock speedup and consistently outperforms strong baselines, demonstrating diffusion-based drafting as a competitive alternative to AR drafters for speculative decoding.

2024

pdf bib abs

PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs
Dan Peng | Zhihui Fu | Jun Wang
Proceedings of the Fifth Workshop on Privacy in Natural Language Processing

Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy.