Shiqi Wang
Other people with similar names: Shiqi Wang
Unverified author pages with similar names: Shiqi Wang
2026
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
Zhaofeng Wu | Shiqi Wang | Boya Peng | Anuj Kumar Goyal | Melanie Kambadur | Sebastian Ruder | Yoon Kim | Chloe Bi
Findings of the Association for Computational Linguistics: ACL 2026
Zhaofeng Wu | Shiqi Wang | Boya Peng | Anuj Kumar Goyal | Melanie Kambadur | Sebastian Ruder | Yoon Kim | Chloe Bi
Findings of the Association for Computational Linguistics: ACL 2026
Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs"—functionally equivalent code implemented in multiple PLs—into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.
2025
Planning-Aware Code Infilling via Horizon-Length Prediction
Yifeng Ding | Hantian Ding | Shiqi Wang | Qing Sun | Varun Kumar | Zijian Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yifeng Ding | Hantian Ding | Shiqi Wang | Qing Sun | Varun Kumar | Zijian Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fill-in-the-Middle (FIM), or infilling, has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm which performs next-token prediction (NTP) over reordered sequence often leads to models struggling to generate content that aligns well with the surrounding context. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different model families and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.
2024
Token Alignment via Character Matching for Subword Completion
Ben Athiwaratkun | Shiqi Wang | Mingyue Shang | Yuchen Tian | Zijian Wang | Sujan Kumar Gonugondla | Sanjay Krishna Gouda | Robert Kwiatkowski | Ramesh Nallapati | Parminder Bhatia | Bing Xiang
Findings of the Association for Computational Linguistics: ACL 2024
Ben Athiwaratkun | Shiqi Wang | Mingyue Shang | Yuchen Tian | Zijian Wang | Sujan Kumar Gonugondla | Sanjay Krishna Gouda | Robert Kwiatkowski | Ramesh Nallapati | Parminder Bhatia | Bing Xiang
Findings of the Association for Computational Linguistics: ACL 2024
Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model’s generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text.
CodeFort: Robust Training for Code Generation Models
Yuhao Zhang | Shiqi Wang | Haifeng Qian | Zijian Wang | Mingyue Shang | Linbo Liu | Sanjay Krishna Gouda | Baishakhi Ray | Murali Krishna Ramanathan | Xiaofei Ma | Anoop Deoras
Findings of the Association for Computational Linguistics: EMNLP 2024
Yuhao Zhang | Shiqi Wang | Haifeng Qian | Zijian Wang | Mingyue Shang | Linbo Liu | Sanjay Krishna Gouda | Baishakhi Ray | Murali Krishna Ramanathan | Xiaofei Ma | Anoop Deoras
Findings of the Association for Computational Linguistics: EMNLP 2024
Code generation models are not robust to small perturbations, which often lead to incorrect generations and significantly degrade the performance of these models. Although improving the robustness of code generation models is crucial to enhancing user experience in real-world applications, existing research efforts do not address this issue. To fill this gap, we propose CodeFort, a framework to improve the robustness of code generation models, generalizing a large variety of code perturbations to enrich the training data and enabling various robust training strategies, mixing data augmentation, batch augmentation, adversarial logits pairing, and contrastive learning, all carefully designed to support high-throughput training. Extensive evaluations show that we increase the average robust pass rates of baseline CodeGen models from 14.79 to 21.74. We notably decrease the robustness drop rate from 95.02% to 54.95% against code-syntax perturbations.
2023
ReCode: Robustness Evaluation of Code Generation Models
Shiqi Wang | Zheng Li | Haifeng Qian | Chenghao Yang | Zijian Wang | Mingyue Shang | Varun Kumar | Samson Tan | Baishakhi Ray | Parminder Bhatia | Ramesh Nallapati | Murali Krishna Ramanathan | Dan Roth | Bing Xiang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shiqi Wang | Zheng Li | Haifeng Qian | Chenghao Yang | Zijian Wang | Mingyue Shang | Varun Kumar | Samson Tan | Baishakhi Ray | Parminder Bhatia | Ramesh Nallapati | Murali Krishna Ramanathan | Dan Roth | Bing Xiang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model’s robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.
Search
Fix author
Co-authors
- Zijian Wang 4
- Mingyue Shang 3
- Parminder Bhatia 2
- Sanjay Krishna Gouda 2
- Varun Kumar 2
- Ramesh Nallapati 2
- Haifeng Qian 2
- Murali Krishna Ramanathan 2
- Baishakhi Ray 2
- Bing Xiang 2
- Ben Athiwaratkun 1
- Chloe Bi 1
- Anoop Deoras 1
- Yifeng Ding 1
- Hantian Ding 1
- Sujan Kumar Gonugondla 1
- Anuj Kumar Goyal 1
- Melanie Kambadur 1
- Yoon Kim 1
- Robert Kwiatkowski 1
- Zheng Li 1
- Linbo Liu 1
- Xiaofei Ma 1
- Boya Peng 1
- Dan Roth 1
- Sebastian Ruder 1
- Qing Sun 1
- Samson Tan 1
- Yuchen Tian 1
- Zhaofeng Wu 1
- Chenghao Yang 1
- Yuhao Zhang 1