Junzhe Wang
2026
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ming Zhang | Yujiong Shen | Jingyi Deng | Yuhui Wang | Huayu Sha | Kexin Tan | Qiyuan Peng | Yue Zhang | Junzhe Wang | Shichun Liu | Yueyuan Huang | Jingqi Tong | Changhao Jiang | Yilong Wu | Zhihao Zhang | Mingqi Wu | Mingxu Chai | Zhiheng Xi | Shihan Dou | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-Fair offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization
Junzhe Wang | Zhiheng Xi | Yajie Yang | Hao Luo | Shihan Dou | Tao Gui | Qi Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junzhe Wang | Zhiheng Xi | Yajie Yang | Hao Luo | Shihan Dou | Tao Gui | Qi Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agents, existing approaches face key limitations: process supervision often suffers from unstable value estimation, whereas outcome supervision struggles with credit assignment due to sparse, trajectory-level rewards. To bridge this gap, we propose Contribution-Weighted GRPO (CW-GRPO), a framework that integrates process supervision into group relative policy optimization. Instead of directly optimizing process rewards, CW-GRPO employs an LLM judge to assess the retrieval utility and reasoning correctness at each search round, producing per-round contribution weights. These weights are used to rescale outcome-based advantages along the trajectory, enabling fine-grained credit assignment without sacrificing optimization stability. Experiments on multiple knowledge-intensive benchmarks show that CW-GRPO outperforms standard GRPO by 5.0% on Qwen3-8B and 6.3% on Qwen3-1.7B, leading to more effective search behaviors. Additional analysis reveals that successful trajectories exhibit concentrated contributions in specific rounds, providing empirical insight into search agent tasks. Our code is available at https://github.com/zsxmwjz/CW-GRPO.
2025
FlowMalTrans: Unsupervised Binary Code Translation for Malware Detection Using Flow-Adapter Architecture
Minghao Hu | Junzhe Wang | Weisen Zhao | Qiang Zeng | Lannan Luo
Findings of the Association for Computational Linguistics: EMNLP 2025
Minghao Hu | Junzhe Wang | Weisen Zhao | Qiang Zeng | Lannan Luo
Findings of the Association for Computational Linguistics: EMNLP 2025
Applying deep learning to malware detection has drawn great attention due to its notable performance. With the increasing prevalence of cyberattacks targeting IoT devices, there is a parallel rise in the development of malware across various Instruction Set Architectures (ISAs). It is thus important to extend malware detection capacity to multiple ISAs. However, training a deep learning-based malware detection model usually requires a large number of labeled malware samples. The process of collecting and labeling sufficient malware samples to build datasets for each ISA is labor-intensive and time-consuming. To reduce the burden of data collection, we propose to leverage the ideas of Neural Machine Translation (NMT) and Normalizing Flows (NFs) for malware detection. Specifically, when dealing with malware in a certain ISA, we translate it to an ISA with sufficient malware samples (like X86-64). This allows us to apply a model trained on one ISA to analyze malware from another ISA. Our approach reduces the data collection effort by enabling malware detection across multiple ISAs using a model trained on a single ISA.
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.
2024
Learning Cross-Architecture Instruction Embeddings for Binary Code Analysis in Low-Resource Architectures
Junzhe Wang | Qiang Zeng | Lannan Luo
Findings of the Association for Computational Linguistics: NAACL 2024
Junzhe Wang | Qiang Zeng | Lannan Luo
Findings of the Association for Computational Linguistics: NAACL 2024
Binary code analysis is indispensable for a variety of software security tasks. Applying deep learning to binary code analysis has drawn great attention because of its notable performance. Today, source code is frequently compiled for various Instruction Set Architectures (ISAs). It is thus critical to expand binary analysis capabilities to multiple ISAs. Given a binary analysis task, the scale of available data on different ISAs varies. As a result, the rich datasets (e.g., malware) for certain ISAs, such as x86, lead to a disproportionate focus on these ISAs and a negligence of other ISAs, such as PowerPC, which suffer from the “data scarcity” problem. To address the problem, we propose to learn cross-architecture instruction embeddings (CAIE), where semantically-similar instructions, regardless of their ISAs, have close embeddings in a shared space. Consequently, we can transfer a model trained on a data-rich ISA to another ISA with less available data. We consider four ISAs (x86, ARM, MIPS, and PowerPC) and conduct both intrinsic and extrinsic evaluations (including malware detection and function similarity comparison). The results demonstrate the effectiveness of our approach to generate high-quality CAIE with good transferability.
2023
Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model
Xiao Wang | Weikang Zhou | Qi Zhang | Jie Zhou | SongYang Gao | Junzhe Wang | Menghan Zhang | Xiang Gao | Yun Wen Chen | Tao Gui
Findings of the Association for Computational Linguistics: ACL 2023
Xiao Wang | Weikang Zhou | Qi Zhang | Jie Zhou | SongYang Gao | Junzhe Wang | Menghan Zhang | Xiang Gao | Yun Wen Chen | Tao Gui
Findings of the Association for Computational Linguistics: ACL 2023
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, which has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end task. Furthermore, we design a gradient matching-based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
Coarse-to-fine Few-shot Learning for Named Entity Recognition
Ruotian Ma | Zhang Lin | Xuanting Chen | Xin Zhou | Junzhe Wang | Tao Gui | Qi Zhang | Xiang Gao | Yun Wen Chen
Findings of the Association for Computational Linguistics: ACL 2023
Ruotian Ma | Zhang Lin | Xuanting Chen | Xin Zhou | Junzhe Wang | Tao Gui | Qi Zhang | Xiang Gao | Yun Wen Chen
Findings of the Association for Computational Linguistics: ACL 2023
Recently, Few-shot Named Entity Recognition has received wide attention with the growing need for NER models to learn new classes with minimized annotation costs. However, one common yet understudied situation is to transfer a model trained with coarse-grained classes to recognize fine-grained classes, such as separating a product category into sub-classes. We find that existing few-shot NER solutions are not suitable for such a situation since they do not consider the sub-class discrimination during coarse training and various granularity of new classes during few-shot learning. In this work, we introduce the Coarse-to-fine Few-shot NER (C2FNER) task and propose an effective solution. Specifically, during coarse training, we propose a cluster-based prototype margin loss to learn group-wise discriminative representations, so as to benefit fine-grained learning. Targeting various granularity of new classes, we separate the coarse classes into extra-fine clusters and propose a novel prototype retrieval and bootstrapping algorithm to retrieve representative clusters for each fine class. We then adopt a mixture prototype loss to efficiently learn the representations of fine classes. We conduct experiments on both in-domain and cross-domain C2FNER settings with various target granularity, and the proposed method shows superior performance over the baseline methods.
RE-Matching: A Fine-Grained Semantic Matching Method for Zero-Shot Relation Extraction
Jun Zhao | WenYu Zhan | Xin Zhao | Qi Zhang | Tao Gui | Zhongyu Wei | Junzhe Wang | Minlong Peng | Mingming Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jun Zhao | WenYu Zhan | Xin Zhao | Qi Zhang | Tao Gui | Zhongyu Wei | Junzhe Wang | Minlong Peng | Mingming Sun
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Semantic matching is a mainstream paradigm of zero-shot relation extraction, which matches a given input with a corresponding label description. The entities in the input should exactly match their hypernyms in the description, while the irrelevant contexts should be ignored when matching. However, general matching methods lack explicit modeling of the above matching pattern. In this work, we propose a fine-grained semantic matching method tailored for zero-shot relation extraction. Guided by the above matching pattern, we decompose the sentence-level similarity score into the entity matching score and context matching score. Considering that not all contextual words contribute equally to the relation semantics, we design a context distillation module to reduce the negative impact of irrelevant components on context matching. Experimental results show that our method achieves higher matching accuracy and more than 10 times faster inference speed, compared with the state-of-the-art methods.
Learning “O” Helps for Learning More: Handling the Unlabeled Entity Problem for Class-incremental NER
Ruotian Ma | Xuanting Chen | Zhang Lin | Xin Zhou | Junzhe Wang | Tao Gui | Qi Zhang | Xiang Gao | Yun Wen Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ruotian Ma | Xuanting Chen | Zhang Lin | Xin Zhou | Junzhe Wang | Tao Gui | Qi Zhang | Xiang Gao | Yun Wen Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As the categories of named entities rapidly increase, the deployed NER models are required to keep updating toward recognizing more entity types, creating a demand for class-incremental learning for NER. Considering the privacy concerns and storage constraints, the standard paradigm for class-incremental NER updates the models with training data only annotated with the new classes, yet the entities from other entity classes are regarded as “Non-entity” (or “O”). In this work, we conduct an empirical study on the “Unlabeled Entity Problem” and find that it leads to severe confusion between “O” and entities, decreasing class discrimination of old classes and declining the model’s ability to learn new classes. To solve the Unlabeled Entity Problem, we propose a novel representation learning method to learn discriminative representations for the entity classes and “O”. Specifically, we propose an entity-aware contrastive learning method that adaptively detects entity clusters in “O”. Furthermore, we propose two effective distance-based relabeling strategies for better learning the old classes. We introduce a more realistic and challenging benchmark for class-incremental NER, and the proposed method achieves up to 10.62% improvement over the baseline methods.
2022
Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents
Yicheng Zou | Hongwei Liu | Tao Gui | Junzhe Wang | Qi Zhang | Meng Tang | Haixiang Li | Daniel Wang
Findings of the Association for Computational Linguistics: ACL 2022
Yicheng Zou | Hongwei Liu | Tao Gui | Junzhe Wang | Qi Zhang | Meng Tang | Haixiang Li | Daniel Wang
Findings of the Association for Computational Linguistics: ACL 2022
Text semantic matching is a fundamental task that has been widely used in various scenarios, such as community question answering, information retrieval, and recommendation. Most state-of-the-art matching models, e.g., BERT, directly perform text comparison by processing each word uniformly. However, a query sentence generally comprises content that calls for different levels of matching granularity. Specifically, keywords represent factual information such as action, entity, and event that should be strictly matched, while intents convey abstract concepts and ideas that can be paraphrased into various expressions. In this work, we propose a simple yet effective training strategy for text semantic matching in a divide-and-conquer manner by disentangling keywords from intents. Our approach can be easily combined with pre-trained language models (PLM) without influencing their inference efficiency, achieving stable performance improvements against a wide range of PLMs on three benchmarks.
Search
Fix author
Co-authors
- Tao Gui 9
- Qi Zhang 7
- Zhiheng Xi 4
- Yun Wen Chen 3
- Xiang Gao 3
- Xuan-Jing Huang (黄萱菁) 3
- Xuanting Chen 2
- Shihan Dou 2
- Songyang Gao 2
- Honglin Guo 2
- Zhang Lin 2
- Lannan Luo 2
- Ruotian Ma 2
- Dingwen Yang 2
- Qiang Zeng 2
- Qi Zhang 2
- Zhihao Zhang 2
- Xin Zhou 2
- Yicheng Zou 2
- Mingxu Chai 1
- Lu Chen 1
- Tinggang Chen 1
- Wenxiang Chen 1
- Jingyi Deng 1
- Yiwen Ding 1
- Minghe Gao 1
- Xin Guo 1
- Xin Guo 1
- Wei He 1
- Boyang Hong 1
- Minghao Hu 1
- Baodai Huang 1
- Jixuan Huang 1
- Yueyuan Huang 1
- Jiaming Ji 1
- Changhao Jiang 1
- Yu-Gang Jiang 1
- Guohao Li 1
- Haixiang Li 1
- Chenyang Liao 1
- Chenyu Liu 1
- Dongrui Liu 1
- Hongwei Liu 1
- Jiaqi Liu 1
- Shichun Liu 1
- Zhonghang Lu 1
- Hao Luo 1
- Minlong Peng 1
- Qiyuan Peng 1
- Xipeng Qiu (邱锡鹏) 1
- Huayu Sha 1
- Yujiong Shen 1
- Jiajun Sun 1
- Mingming Sun 1
- Kexin Tan 1
- Meng Tang 1
- Jingqi Tong 1
- Daniel Wang 1
- Xiao Wang 1
- Yuhui Wang 1
- Zhongyu Wei (魏忠钰) 1
- Mingqi Wu 1
- Yilong Wu 1
- Zuxuan Wu 1
- Yajie Yang 1
- Yuming Yang 1
- Junjie Ye (叶俊杰) 1
- Wenyu Zhan 1
- Jiazheng Zhang 1
- Menghan Zhang 1
- Ming Zhang 1
- Qi Zhang 1
- Yue Zhang 1
- Jun Zhao 1
- Weisen Zhao 1
- Xin Zhao 1
- Rui Zheng 1
- Jie Zhou 1
- Weikang Zhou 1
- Dingwei Zhu 1