Aishan Liu
2026
Scaling Laws for Code: Every Programming Language Matters
Jian Yang | Shuyue Guo | Linzheng Chai | Wei Zhang | Aishan Liu | Chuan Hao | Zhoujun Li | Xin Zhao | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Jian Yang | Shuyue Guo | Linzheng Chai | Wei Zhang | Aishan Liu | Chuan Hao | Zhoujun Li | Xin Zhao | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish scaling laws for code LLMs across multiple programming languages, showing that interpreted languages benefit more from increased scale than compiled ones. Multilingual pre-training provides synergistic benefits, especially between syntactically similar languages, with parallel pairing (concatenating code with translations) significantly enhancing cross-lingual abilities. We propose a proportion-dependent multilingual scaling law that optimally allocates training tokens by prioritizing high-utility languages (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (e.g., Rust), achieving superior performance across all languages compared to uniform distribution.
LoopCoder: Scaling Code Intelligence via Looped Language Models
Jian Yang | Wei Zhang | Shuyue Guo | Yizhi LI | Linzheng Chai | Zhengmao Ye | Shukai Liu | Yuyang Song | Jiajun Wu | Che Liu | Tianyu Zheng | Siwei Wu | Leo L | Xudong Ma | Chuan Hao | Ran Tao | Yan Xing | Jianzhou Wang | Mingjie Tang | Aishan Liu | Zhoujun Li | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Jian Yang | Wei Zhang | Shuyue Guo | Yizhi LI | Linzheng Chai | Zhengmao Ye | Shukai Liu | Yuyang Song | Jiajun Wu | Che Liu | Tianyu Zheng | Siwei Wu | Leo L | Xudong Ma | Chuan Hao | Ran Tao | Yan Xing | Jianzhou Wang | Mingjie Tang | Aishan Liu | Zhoujun Li | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
While large language models (LLMs) have mastered syntax-level code generation, complex algorithmic reasoning remains a challenge, typically addressed by scaling model depth and parameter count. Universal Transformers (UT) offer a compelling alternative by introducing a recurrent inductive bias that aligns with the recursive nature of programming logic. However, training looped architectures at scale has historically been hindered by severe instability and optimization difficulties associated with backpropagation through time (BPTT). We present LoopCoder (40B-A80B) pre-trained on 12T+ code and general tokens, along with LoopCoder-Thinking and LoopCoder-Instruct variants—the first large-scale looped transformer for code, achieving comparable performance to standard dense architectures with more parameters. Unlike prior approaches that restrict recurrence to small-scale tasks, we implement a comprehensive looped training protocol spanning both pre-training and post-training phases. We initiate the model via dense-to-loop transformation, folding a pre-trained dense checkpoint to initialize a recurrent block, followed by rigorous looped pre-training and specialized post-training for instruction following and reasoning. Our results establish a robust recipe for scaling coding intelligence via recurrent computation, proving that dense checkpoints serve as an optimal foundation for evolving into dynamic, looped reasoners.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
Zonghao Ying | Yangguang Shao | Jianle Gan | Gan Xu | Wenxin Zhang | Quanchen Zou | Junzheng Shi | Zhenfei Yin | Mingchuan Zhang | Aishan Liu | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Zonghao Ying | Yangguang Shao | Jianle Gan | Gan Xu | Wenxin Zhang | Quanchen Zou | Junzheng Shi | Zhenfei Yin | Mingchuan Zhang | Aishan Liu | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Large vision–language model (LVLM)-based web agents are emerging as powerful automation tools but face severe security risks in real-world deployment. Existing benchmarks offer limited coverage, typically isolating user-level prompts from environmental threats, thus failing to capture the full spectrum of vulnerabilities. To address this, we present SecureWebArena, the first holistic security benchmark for web agents. SecureWebArena features a unified suite of six realistic web environments with 2,970 adversarial trajectories, covering a structured taxonomy of six attack vectors that span both user-level and environment-level manipulations. Crucially, we introduce a multi-layered evaluation protocol that dissects agent failures across internal reasoning, behavioral execution, and task outcomes, enabling fine-grained risk analysis beyond simple success metrics. Experiments on 9 representative LVLMs reveal universal vulnerabilities to subtle manipulations and uncover significant trade-offs between model specialization and security. SecureWebArena establishes a rigorous foundation for advancing the development of trustworthy web agents.
Uncovering Strategic Egoism Behaviors in Large Language Models
Yaoyuan Zhang | Zonghao Ying | Aishan Liu | Jian Yang | Tianlin Li | Yaodong Yang | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Yaoyuan Zhang | Zonghao Ying | Aishan Liu | Jian Yang | Tianlin Li | Yaodong Yang | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) exhibit growing safety and alignment risks, hindering their deployment in high-stakes decision-making scenarios. In this paper, we identify a previously underexplored risk: similar to humans, LLMs can exhibit egoistic decision-making, in which they pursue short-term self-benefits through improper means while disregarding collective welfare and ethical constraints. We term this phenomenon Strategic Egoism (SE). To systematically evaluate SE, we introduce SEBench, a benchmark comprising 880 decision-making scenarios across 11 domains involving explicit profit temptations, which measures egoistic behavior along 6 psychologically grounded dimensions (e.g., rule circumvention). Each scenario adopts a single-role decision-making setting with carefully designed choice options to elicit self-serving strategies. Extensive experiments on 9 proprietary LLMs reveal that SE behaviors are widespread, with an average occurrence rate of 67.96%, and frequently manifest as manipulative coercion. Notably, we find that models more susceptible to profit temptations also exhibit broader safety deficiencies, including higher toxicity, lower truthfulness, increased jailbreak vulnerability, and elevated Dark Triad–style trait scores. Drawing inspiration from psychological interventions, we further propose SEGuard, a lightweight mitigation that reinforces situational constraints and suppresses egoistic tactics.
2025
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
Zonghao Ying | Deyue Zhang | Zonglei Jing | Yisong Xiao | Quanchen Zou | Aishan Liu | Siyuan Liang | Xiangzheng Zhang | Xianglong Liu | Dacheng Tao
Findings of the Association for Computational Linguistics: EMNLP 2025
Zonghao Ying | Deyue Zhang | Zonglei Jing | Yisong Xiao | Quanchen Zou | Aishan Liu | Siyuan Liang | Xiangzheng Zhang | Xianglong Liu | Dacheng Tao
Findings of the Association for Computational Linguistics: EMNLP 2025
Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation (RACE), a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs’ strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves average ASR of 83.3% against leading commercial models, including Gemini 2.0 Flashing Thinking and OpenAI o1, underscoring its potency.
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
Xuxu Liu | Siyuan Liang | Mengya Han | Yong Luo | Aishan Liu | Xiantao Cai | Zheng He | Dacheng Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xuxu Liu | Siyuan Liang | Mengya Han | Yong Luo | Aishan Liu | Xiantao Cai | Zheng He | Dacheng Tao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish ELBA-Bench, a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning (e.g., LoRA) or without fine-tuning techniques (e.g., In-context-learning). ELBA-Bench provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research at https://github.com/NWPUliuxx/ELBA_Bench, with the goal of propelling further progress in this vital area.
2020
Dialogue Policies for Learning Board Games through Multimodal Communication
Maryam Zare | Ali Ayub | Aishan Liu | Sweekar Sudhakara | Alan Wagner | Rebecca Passonneau
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Maryam Zare | Ali Ayub | Aishan Liu | Sweekar Sudhakara | Alan Wagner | Rebecca Passonneau
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
This paper presents MDP policy learning for agents to learn strategic behavior–how to play board games–during multimodal dialogues. Policies are trained offline in simulation, with dialogues carried out in a formal language. The agent has a temporary belief state for the dialogue, and a persistent knowledge store represented as an extensive-form game tree. How well the agent learns a new game from a dialogue with a simulated partner is evaluated by how well it plays the game, given its dialogue-final knowledge state. During policy training, we control for the simulated dialogue partner’s level of informativeness in responding to questions. The agent learns best when its trained policy matches the current dialogue partner’s informativeness. We also present a novel data collection for training natural language modules. Human subjects who engaged in dialogues with a baseline system rated the system’s language skills as above average. Further, results confirm that human dialogue partners also vary in their informativeness.
Search
Fix author
Co-authors
- Xianglong Liu 5
- Jian Yang 3
- Zonghao Ying 3
- Linzheng Chai 2
- Bryan Dai 2
- Shuyue Guo 2
- Chuan Hao 2
- Zhoujun Li 2
- Siyuan Liang 2
- Weifeng Lv 2
- Dacheng Tao 2
- Wei Zhang 2
- Quanchen Zou 2
- Ali Ayub 1
- Xiantao Cai 1
- Jianle Gan 1
- Mengya Han 1
- Zheng He (何铮) 1
- Zonglei Jing 1
- Leo L 1
- Yizhi Li 1
- Tianlin Li 1
- Shukai Liu 1
- Che Liu 1
- Xuxu Liu 1
- Yong Luo 1
- Xudong Ma 1
- Rebecca J. Passonneau 1
- Yangguang Shao 1
- Junzheng Shi 1
- Yuyang Song 1
- Sweekar Sudhakara 1
- Mingjie Tang 1
- Ran Tao 1
- Alan Wagner 1
- Jianzhou Wang 1
- Jiajun Wu 1
- Siwei Wu 1
- Yisong Xiao 1
- Yan Xing 1
- Gan Xu 1
- Yaodong Yang (杨耀东) 1
- Zhengmao Ye 1
- Zhenfei Yin 1
- Maryam Zare 1
- Deyue Zhang 1
- Xiangzheng Zhang 1
- Wenxin Zhang 1
- Mingchuan Zhang 1
- Yaoyuan Zhang 1
- Wayne Xin Zhao 1
- Tianyu Zheng 1