Weikang Zhou


2025

pdf bib
Multi-Programming Language Sandbox for LLMs
Shihan Dou | Jiazheng Zhang | Jianxiang Zang | Yunbo Tao | Weikang Zhou | Haoxiang Jia | Shichun Liu | Yuming Yang | Shenxi Wu | Zhiheng Xi | Muling Wu | Rui Zheng | Changze Lv | Limao Xiong | Shaoqing Zhang | Lin Zhang | Wenyu Zhan | Rongxiang Weng | Jingang Wang | Xunliang Cai | Yueming Wu | Ming Wen | Yixin Cao | Tao Gui | Xipeng Qiu | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. It also can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we conduct extensive experiments by integrating it into several training and deployment scenarios, and employing it to optimize workflows for a wide range of downstream code tasks. Our goal is to enhance researcher productivity on LLM-based code tasks by simplifying and automating workflows through delegation to MPLSandbox.

pdf bib
Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation
Sirui Xia | Xintao Wang | Jiaqing Liang | Yifei Zhang | Weikang Zhou | Jiaji Deng | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: NAACL 2025

Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) in knowledge-intensive tasks. To enhance credibility and verifiability in RAG systems, Attributed Text Generation (ATG) is proposed, which provides citations to retrieval knowledge in LLM-generated responses. Prior methods mainly adopt coarse-grained attributions, with passage-level or paragraph-level references or citations, which fall short in verifiability. This paper proposes ReClaim(Refer & Claim), a fine-grained ATG method that alternates the generation of references and answers step by step. Different from previous coarse-grained attribution, ReClaim provides sentence-level citations in long-form question-answering tasks. With extensive experiments, we verify the effectiveness of ReClaim in extensive settings, achieving a citation accuracy rate of 90%.

pdf bib
Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following
Jie Zeng | Qianyu He | Qingyu Ren | Jiaqing Liang | Weikang Zhou | Zeye Sun | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2025

Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a “hard-to-easy” order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM’s attention and constraint orders. Our code and dataset are publicly available at https://github.com/meowpass/PBIF.

pdf bib
Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models
Qingyu Ren | Jie Zeng | Qianyu He | Jiaqing Liang | Yanghua Xiao | Weikang Zhou | Zeye Sun | Fei Yu
Findings of the Association for Computational Linguistics: ACL 2025

It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. In real-world scenarios, user instructions often contain soft constraints, which are semantically related and cannot be rule-based verified, posing challenges for LLMs. To enhance the soft constraint following ability of LLMs, we initially design a pipeline to construct datasets with high-quality outputs for instructions containing soft constraints automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs’ soft constraint following ability and analyze the factors driving the improvements.

2023

pdf bib
Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model
Xiao Wang | Weikang Zhou | Qi Zhang | Jie Zhou | SongYang Gao | Junzhe Wang | Menghan Zhang | Xiang Gao | Yun Wen Chen | Tao Gui
Findings of the Association for Computational Linguistics: ACL 2023

Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, which has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end task. Furthermore, we design a gradient matching-based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.