Jack W. Stokes

2026

Apeiron: A Scalable LLM-agentic Framework for Autonomous Full-lifecycle Demand-optimized Application Synthesis
Junyan Cheng | Ankit Srivastava | Jessie Zeng | Milenko Drinic | Jack W. Stokes
Findings of the Association for Computational Linguistics: ACL 2026

We introduce Apeiron, a scalable and extensible framework for addressing *amorphous* user demands through autonomous, full-lifecycle application synthesis. Apeiron models the unstructured app development process as a heuristic optimization problem combining (i) a Computer-Use Agent (CUA) evaluator that simulates personas and demands, (ii) an *Activity Tracer* that grounds feedback in code-level interaction traces, and (iii) a *Locality Controller* that constrains changes during continuous integration and delivery (CI/CD). Furthermore, we introduce an innovative data generation approach using CUA-as-a-Judge to tackle data scarcity. Across 300 app scenarios, 2,400 personas, and 46,338 demands, Apeiron outperformed baselines by 10.7% in CUA ratings and 27.8% in user-demand task scores. The optimization process enhances task scores by 64.7%, and the tracer contributes a 25.1% gain. In CI/CD, Apeiron effectively restores 96.9% of the pre-shift mean CUA rating in one optimization step with <30% code changes in response to 30% demand shifts. Finally, a user study (N=18) shows that our CUA ratings strongly correlate with human judgment (Spearman’s 𝜌=0.685) and that users prefer Apeiron-synthesized apps over baselines.

pdf bib abs

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
Andrew Zhao | Reshmi Ghosh | Vitor R. Carvalho | Emily Lawton | Keegan Hines | Gao Huang | Jack W. Stokes
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language model (LLM) systems increasingly power everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on manually well-crafted prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to query poisoning alone: feedback-based attacks raise attack success rate (ASR) by up to ΔASR = 0.48. We introduce a simple fake reward attack that requires no access to the reward model and significantly increases vulnerability. We also propose a lightweight highlighting defense that reduces the fake reward ΔASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

2025

pdf bib abs

LLMs often fail to meet specialized needs of distinct user groups due to their one-size-fits-all approach, and there is limited understanding of what personalization each group expects.To address this, we propose GPA a group-aware personalization framework that captures context-specific preference variations and steers LLMs accordingly.Our approach involves: (1) Group-Aware Preference Extraction, which distills divergent preferences from real-world conversation logs into interpretable rubrics, and (2) Tailored Response Generation, using (a) GPA-CT, which adapts responses using learnt rubrics, and (b) GPA-FT, which finetunes models using rubric-guided synthetic data.Automatic and Human evaluations confirm that GPA improves group alignment without compromising perfomance on standard instruction-following benchmarks.