Jun Wang
Other people with similar names: Jun Wang, Jun Wang, Jun Wang, Jun Wang
Unverified author pages with similar names: Jun Wang
2026
“I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns?
Naen Xu | Jiayi Sheng | Changjiang Li | Chunyi Zhou | Yuyuan Li | Tianyu Du | Jun Wang | Zhihui Fu | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Naen Xu | Jiayi Sheng | Changjiang Li | Chunyi Zhou | Yuyuan Li | Tianyu Du | Jun Wang | Zhihui Fu | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.
POP: Prefill-Only Pruning for Efficient Large Model Inference
Junhui He | Zhihui Fu | Jun Wang | Qingan Li
Findings of the Association for Computational Linguistics: ACL 2026
Junhui He | Zhihui Fu | Jun Wang | Qingan Li
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities.However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37× speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors
Rui Yin | Tianxu Han | Naen Xu | Changjiang Li | Ping He | Chunyi Zhou | Jun Wang | Zhihui Fu | Tianyu Du | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rui Yin | Tianxu Han | Naen Xu | Changjiang Li | Ping He | Chunyi Zhou | Jun Wang | Zhihui Fu | Tianyu Du | Jinbao Li | Shouling Ji
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., “Sure”), which does not guarantee sustained harmful output—the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.
Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning
Zheng Wu | Xingyu Lou | Xinbei Ma | Yansi Li | Weiwen Liu | Weinan Zhang | Jun Wang | Zhuosheng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Zheng Wu | Xingyu Lou | Xinbei Ma | Yansi Li | Weiwen Liu | Weinan Zhang | Jun Wang | Zhuosheng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability–plasticity dilemma.In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation.Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics.We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability–plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding
Guanghao Li | Zhihui Fu | Min Fang | Qibin Zhao | Ming Tang | Chun Yuan | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2026
Guanghao Li | Zhihui Fu | Min Fang | Qibin Zhao | Ming Tang | Chun Yuan | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2026
Autoregressive (AR) decoding in large language models (LLMs) is latency-bounded by strictly sequential token generation.Speculative decoding mitigates this bottleneck by letting a fast drafter propose multi-token candidates that are then verified in parallel by the target model; yet most existing systems still rely on AR drafters, limiting wall-clock gains.We present **DiffuSpec**, which repurposes a *diffusion language model* (DLM) as a *parallel* drafter to generate multi-token proposals in a single forward pass while remaining compatible with standard AR verifiers.However, DLM drafting presents unique challenges: 1) bidirectional conditioning produces a token lattice where locally optimal tokens may fail to form a valid causal sequence; 2) the mechanism requires tuning the draft length, which induces a speed–quality trade-off. To address these issues, we introduce (i) *Causal-consistency Path Search* (CPS) to extract verifier-aligned causal paths from the lattice, and (ii) an *Adaptive Draft-Length* (ADL) controller that adjusts proposal lengths using online acceptance feedback.Across benchmarks, DiffuSpec achieves up to 3× wall-clock speedup and consistently outperforms strong baselines, demonstrating diffusion-based drafting as a competitive alternative to AR drafters for speculative decoding.
2025
STaR-SQL: Self-Taught Reasoner for Text-to-SQL
Mingqian He | Yongliang Shen | Wenqi Zhang | Qiuying Peng | Jun Wang | Weiming Lu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mingqian He | Yongliang Shen | Wenqi Zhang | Qiuying Peng | Jun Wang | Weiming Lu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generating step-by-step “chain-of-thought” rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.
AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification
Xuan Zhang | Yongliang Shen | Zhe Zheng | Linjuan Wu | Wenqi Zhang | Yuchen Yan | Qiuying Peng | Jun Wang | Weiming Lu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xuan Zhang | Yongliang Shen | Zhe Zheng | Linjuan Wu | Wenqi Zhang | Yuchen Yan | Qiuying Peng | Jun Wang | Weiming Lu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated remarkable capabilities in tool learning. In real-world scenarios, user queries are often ambiguous and incomplete, requiring effective clarification. However, existing interactive clarification approaches face two critical limitations: reliance on manually constructed datasets, which inherently constrains training data scale and diversity, and lack of error correction mechanisms during multi-turn clarification, leading to error accumulation that compromises both accuracy and efficiency. We present AskToAct, which addresses these challenges by exploiting the structural mapping between queries and their tool invocation solutions. Our key insight is that tool parameters naturally represent explicit user intents. By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data. We further enhance model robustness through error-correction pairs and selective masking, enabling dynamic error detection during clarification interactions. Comprehensive experiments demonstrate that AskToAct significantly outperforms existing approaches, achieving above 57% accuracy in recovering critical unspecified intents and enhancing clarification efficiency by an average of 10.46% while maintaining high accuracy in tool invocation. Our framework exhibits robust performance across different model architectures and successfully generalizes to entirely unseen APIs without additional training, achieving performance comparable to GPT-4o with substantially fewer computational resources.
OAgents: An Empirical Study of Building Effective Agents
He Zhu | Tianrui Qin | King Zhu | Heyuan Huang | Yeyi Guan | Jinxiang Xia | Hanhao Li | Yi Yao | Ningning Wang | Pai Liu | Tianhao Peng | Xin Gui | Li Xiaowan | Yuhui Liu | Xiangru Tang | Jian Yang | Ge Zhang | Xitong Gao | Yuchen Eleanor Jiang | Changwang Zhang | Jun Wang | Jiaheng Liu | Wangchunshu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
He Zhu | Tianrui Qin | King Zhu | Heyuan Huang | Yeyi Guan | Jinxiang Xia | Hanhao Li | Yi Yao | Ningning Wang | Pai Liu | Tianhao Peng | Xin Gui | Li Xiaowan | Yuhui Liu | Xiangru Tang | Jian Yang | Ge Zhang | Xitong Gao | Yuchen Eleanor Jiang | Changwang Zhang | Jun Wang | Jiaheng Liu | Wangchunshu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Recently, Agentic AI has become an increasingly popular field of research. However, we argue that current practices on agent research are far from standard, rigorous scientific research, which makes it hard to conduct apples-to-apples comparisons among and against existing methods. As a result, it is still obscure how different design choices in an agent framework impact its effectiveness, and measuring progress on agent research remains very hard. In this work, we conduct a systematic empirical study on the GAIA benchmark to investigate the impact of different popular design choices within key agent components in a fair and rigorous way. To begin with, we find that the lack of a standard evaluation protocol makes previous works, even the open-sourced ones, not reproducible, and the variance between different random runs is often non-negligible. Therefore, we first introduce a more robust evaluation protocol to make comparisons more stable. Our empirical study then unveils which components and designs, as well as correlations between these designs, are the keys for building effective agents, while others are not and redundant, despite seemingly making sense. With the insights gained from our empirical study, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects, providing a good starting point and guidelines for building effective agents. More importantly, supports various design choices for agent components in a modularized way, facilitating future scientific research on Agentic AI.
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Assistant Scenarios
Jun Wang | Jiamu Zhou | Xihuai Wang | Xiaoyun Mo | Haoyu Zhang | Qiqiang Lin | Cheng Jin | Muning Wen | Weinan Zhang | Qiuying Peng | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Jun Wang | Jiamu Zhou | Xihuai Wang | Xiaoyun Mo | Haoyu Zhang | Qiqiang Lin | Cheng Jin | Muning Wen | Weinan Zhang | Qiuying Peng | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs’ function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.
DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL
Haoyuan Ma | Yongliang Shen | Hengwei Liu | Wenqi Zhang | Haolei Xu | Qiuying Peng | Jun Wang | Weiming Lu
Findings of the Association for Computational Linguistics: EMNLP 2025
Haoyuan Ma | Yongliang Shen | Hengwei Liu | Wenqi Zhang | Haolei Xu | Qiuying Peng | Jun Wang | Weiming Lu
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent text-to-SQL systems powered by large language models (LLMs) have demonstrated remarkable performance in translating natural language queries into SQL.However, these systems often struggle with complex database structures and domain-specific queries, as they primarily focus on enhancing logical reasoning and SQL syntax while overlooking the critical need for comprehensive database understanding.To address this limitation, we propose DB-Explore, a novel framework that systematically aligns LLMs with database knowledge through automated exploration and instruction synthesis.DB-Explore constructs database graphs to capture complex relational schemas, leverages GPT-4 to systematically mine structural patterns and semantic knowledge, and synthesizes instructions to distill this knowledge for efficient fine-tuning of LLMs.Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation, bridging the gap between database structures and language models.Experiments conducted on the SPIDER and BIRD benchmarks validate the effectiveness of DB-Explore, achieving an execution accuracy of 67.0% on BIRD and 87.8% on SPIDER. Notably, our open‐source implementation based on Qwen2.5‐Coder‐7B achieves state‐of‐the‐art results at minimal computational cost, outperforming several GPT‐4‐driven Text‐to‐SQL systems.
Search
Fix author
Co-authors
- Zhihui Fu 4
- Qiuying Peng 4
- Weiming Lu 3
- Yongliang Shen 3
- Wenqi Zhang 3
- Tianyu Du 2
- Shouling Ji 2
- Changjiang Li 2
- Jinbao Li 2
- Naen Xu 2
- Weinan Zhang 2
- Chunyi Zhou 2
- Min Fang 1
- Xitong Gao 1
- Yeyi Guan 1
- Xin Gui 1
- Tianxu Han 1
- Mingqian He 1
- Junhui He 1
- Ping He 1
- Heyuan Huang 1
- Yuchen Eleanor Jiang 1
- Cheng Jin 1
- Yuyuan Li 1
- Qingan Li 1
- Hanhao Li 1
- Yansi Li 1
- Guanghao Li 1
- Qiqiang Lin 1
- Pai Liu 1
- Yuhui Liu 1
- Jiaheng Liu 1
- Hengwei Liu 1
- Weiwen Liu 1
- Xingyu Lou 1
- Haoyuan Ma 1
- Xinbei Ma 1
- Xiaoyun Mo 1
- Tianhao Peng 1
- Tianrui Qin 1
- Jiayi Sheng 1
- Xiangru Tang 1
- Ming Tang 1
- Ningning Wang 1
- Jun Wang 1
- Xihuai Wang 1
- Muning Wen 1
- Linjuan Wu 1
- Zheng Wu 1
- Jinxiang Xia 1
- Li Xiaowan 1
- Haolei Xu 1
- Yuchen Yan 1
- Jian Yang 1
- Yi Yao 1
- Rui Yin 1
- Chun Yuan 1
- Xuan Zhang 1
- Ge Zhang 1
- Changwang Zhang 1
- Haoyu Zhang 1
- Zhuosheng Zhang 1
- Qibin Zhao 1
- Zhe Zheng 1
- Wangchunshu Zhou 1
- Jiamu Zhou 1
- He Zhu 1
- King Zhu 1