Chong Wang
2026
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories
Yanlin Wang | Ziyao Zhang | Chong Wang | Xinyi Xu | Mingwei Liu | Yong Wang | Jiachi Chen | Zibin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Yanlin Wang | Ziyao Zhang | Chong Wang | Xinyi Xu | Mingwei Liu | Yong Wang | Jiachi Chen | Zibin Zheng
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic vulnerabilities or evaluating functional correctness in isolation, failing to capture the complex interplay between functionality and security found in real-world software. To address this gap, we introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories. Our methodology employs a multi-stage pipeline that combines systematic SAST scanning with CodeQL, LLM-based false positive elimination, and rigorous human expert validation. The resulting benchmark contains 105 instances grounded in real-word repository contexts, spanning 19 Common Weakness Enumeration (CWE) types and exhibiting a wide diversity of data flow complexities, including vulnerabilities with up to 34-hop inter-procedural dependencies. Using RealSec-bench, we conduct an extensive empirical study on 5 popular LLMs. We introduce a novel composite metric, SecurePass@K, to assess both functional correctness and security simultaneously. We find that while Retrieval-Augmented Generation (RAG) techniques can improve functional correctness, they provide negligible benefits to security. Furthermore, explicitly prompting models with general security guidelines often leads to compilation failures, harming functional correctness without reliably preventing vulnerabilities. Our work highlights the gap between functional and secure code generation in current LLMs. Our code and data are available at https://github.com/DeepSoftwareAnalytics/Realsec-code-Bench.
Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults
Zhenhao Zhou | Zhuochen Huang | Yike He | Chong Wang | Jiajun Wang | Yijian Wu | Xin Peng | Yiling Lou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhenhao Zhou | Zhuochen Huang | Yike He | Chong Wang | Jiajun Wang | Yijian Wu | Xin Peng | Yiling Lou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL+, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL+ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs.
2025
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
Yihao Huang | Chong Wang | Xiaojun Jia | Qing Guo | Felix Juefei-Xu | Jian Zhang | Yang Liu | Geguang Pu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yihao Huang | Chong Wang | Xiaojun Jia | Qing Guo | Felix Juefei-Xu | Jian Zhang | Yang Liu | Geguang Pu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Universal goal hijacking is a kind of prompt injection attack that forces LLMs to return a target malicious response for arbitrary normal user prompts. The previous methods achieve high attack performance while being too cumbersome and time-consuming. Also, they have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To this end, we propose a method called POUGH that incorporates an efficient optimization algorithm and two semantics-guided prompt organization strategies. Specifically, our method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes them. Given the sequentially ranked prompts, our method employs an iterative optimization algorithm to generate a fixed suffix that can concatenate to arbitrary user prompts for universal goal hijacking. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness.
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
Alperen Yildiz | Sin G Teo | Yiling Lou | Yebo Feng | Chong Wang | Dinil Mon Divakaran
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Alperen Yildiz | Sin G Teo | Yiling Lou | Yebo Feng | Chong Wang | Dinil Mon Divakaran
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality.We introduce JITVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JITVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.
2020
Multi-Domain Neural Machine Translation with Word-Level Adaptive Layer-wise Domain Mixing
Haoming Jiang | Chen Liang | Chong Wang | Tuo Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Haoming Jiang | Chen Liang | Chong Wang | Tuo Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Many multi-domain neural machine translation (NMT) models achieve knowledge transfer by enforcing one encoder to learn shared embedding across domains. However, this design lacks adaptation to individual domains. To overcome this limitation, we propose a novel multi-domain NMT model using individual modules for each domain, on which we apply word-level, adaptive and layer-wise domain mixing. We first observe that words in a sentence are often related to multiple domains. Hence, we assume each word has a domain proportion, which indicates its domain preference. Then word representations are obtained by mixing their embedding in individual domains based on their domain proportions. We show this can be achieved by carefully designing multi-head dot-product attention modules for different domains, and eventually taking weighted averages of their parameters by word-level layer-wise domain proportions. Through this, we can achieve effective domain knowledge sharing and capture fine-grained domain-specific knowledge as well. Our experiments show that our proposed model outperforms existing ones in several NMT tasks.
2018
Subgoal Discovery for Hierarchical Dialogue Policy Learning
Da Tang | Xiujun Li | Jianfeng Gao | Chong Wang | Lihong Li | Tony Jebara
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Da Tang | Xiujun Li | Jianfeng Gao | Chong Wang | Lihong Li | Tony Jebara
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Developing agents to engage in complex goal-oriented dialogues is challenging partly because the main learning signals are very sparse in long conversations. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given successful example dialogues, we propose the Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a multi-level policy by hierarchical reinforcement learning. We demonstrate our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals. Moreover, we show that the learned subgoals are often human comprehensible.
2014
Dynamic Language Models for Streaming Text
Dani Yogatama | Chong Wang | Bryan R. Routledge | Noah A. Smith | Eric P. Xing
Transactions of the Association for Computational Linguistics, Volume 2
Dani Yogatama | Chong Wang | Bryan R. Routledge | Noah A. Smith | Eric P. Xing
Transactions of the Association for Computational Linguistics, Volume 2
We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres—economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models.
Search
Fix author
Co-authors
- Yiling Lou 2
- Jiachi Chen 1
- Dinil Mon Divakaran 1
- Yebo Feng 1
- Jianfeng Gao 1
- Qing Guo 1
- Yike He 1
- Yihao Huang 1
- Zhuochen Huang 1
- Tony Jebara 1
- Xiaojun Jia 1
- Haoming Jiang 1
- Felix Juefei-Xu 1
- Xiujun Li 1
- Lihong Li 1
- Chen Liang 1
- Yang Liu 1
- Mingwei Liu 1
- Xin Peng 1
- Geguang Pu 1
- Bryan R. Routledge 1
- Noah A. Smith 1
- Da Tang 1
- Sin G Teo 1
- Yanlin Wang 1
- Yong Wang 1
- Jiajun Wang 1
- Yijian Wu 1
- Eric Xing 1
- Xinyi Xu 1
- Alperen Yildiz 1
- Dani Yogatama 1
- Jian Zhang 1
- Ziyao Zhang 1
- Tuo Zhao 1
- Zibin Zheng 1
- Zhenhao Zhou 1