Lin Shi


2025

pdf bib
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study
Bowen Li | Wenhan Wu | Ziwei Tang | Lin Shi | John Yang | Jinyang Li | Shunyu Yao | Chen Qian | Binyuan Hui | Qicheng Zhang | Zhiyin Yu | He Du | Ping Yang | Dahua Lin | Chao Peng | Kai Chen
Proceedings of the 31st International Conference on Computational Linguistics

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.

pdf bib
Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach
Yurong Wu | Fangwen Mu | Qiuhong Zhang | Jinjing Zhao | Xinrun Xu | Lingrui Mei | Yang Wu | Lin Shi | Junjie Wang | Zhiming Ding | Yiwei Wang
Findings of the Association for Computational Linguistics: ACL 2025

Prompt trading has emerged as a significant intellectual property concern in recent years, where vendors entice users by showcasing sample images before selling prompt templates that can generate similar images. This work investigates a critical security vulnerability: attackers can steal prompt templates using only a limited number of sample images. To investigate this threat, we introduce Prism, a prompt-stealing benchmark consisting of 50 templates and 450 images, organized into Easy and Hard difficulty levels. To identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a novel template stealing method that operates without model fine-tuning by leveraging differential evolution algorithms. The system first initializes population sets using multimodal large language models (MLLMs) based on predefined patterns, then iteratively generates enhanced offspring through MLLMs. During evolution, EvoStealer identifies common features across offspring to derive generalized templates. Our comprehensive evaluation conducted across open-source (InternVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini) demonstrates that EvoStealer’s stolen templates can reproduce images highly similar to originals and effectively generalize to other subjects, significantly outperforming baseline methods with an average improvement of over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template stealing with negligible computational expenses. Our code and dataset are available at https://whitepagewu.github.io/evostealer-site.

pdf bib
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplification and Resistance in Multi-Agent Based LLM-as-Judge
Chiyu Ma | Enpei Zhang | Yilun Zhao | Wenjun Liu | Yaning Jia | Peijun Qing | Lin Shi | Arman Cohan | Yujun Yan | Soroush Vosoughi
Findings of the Association for Computational Linguistics: EMNLP 2025

LLM-as-Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi-agent extensions such as multi-agent debate and meta-judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta-judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single-agent debiasing method, as a bias-free agent within these systems. The results reveal that this bias-free agent effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios. Our work provides a comprehensive study of bias behavior in multi-agent LLM-as-Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.

pdf bib
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Lin Shi | Chiyu Ma | Wenhua Liang | Xingjian Diao | Weicheng Ma | Soroush Vosoughi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

LLM-as-a-Judge has emerged as a promising alternative to human evaluators across various tasks, yet inherent biases—particularly position bias, the tendency to favor solutions based on their position within the prompt—compromise its reliability. This exploratory study evaluates position bias in LLM judges across pairwise and list-wise comparison settings, introducing three metrics: repetition stability, position consistency, and preference fairness. Our experiments, involving 15 LLM judges across MTBench and DevBench with 22 tasks and approximately 40 solution-generating models, result in over 150,000 evaluation instances. We identify Judge-Level, Candidate-Level, and Task-Level factors contributing to bias. The findings confirm that position bias is not due to random chance and varies significantly across judges and tasks. While position bias is weakly influenced by the length of prompt components, it is strongly affected by the quality gap between solutions. Our agreement and disagreement analysis among judges further provides insights into the distribution of judging difficulty across the dataset, and highlights the potential for dataset modifications.

2017

pdf bib
Sogou Neural Machine Translation Systems for WMT17
Yuguang Wang | Shanbo Cheng | Liyang Jiang | Jiajun Yang | Wei Chen | Muze Li | Lin Shi | Yanfeng Wang | Hongtao Yang
Proceedings of the Second Conference on Machine Translation