Qianou Ma
2026
What Prompts Don’t Say: Understanding and Managing Underspecification in LLM Prompts
Chenyang Yang | Yike Shi | Qianou Ma | Michael Xieyang Liu | Christian Kaestner | Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2026
Chenyang Yang | Yike Shi | Qianou Ma | Michael Xieyang Liu | Christian Kaestner | Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2026
Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), such behavior is fragile: Under-specified prompts are 2x as likely to regress across model or prompt changes, sometimes with accuracy drops exceeding 20%. This instability makes it difficult to reliably build LLM applications. Moreover, simply specifying all requirements does not consistently help, as models have limited instruction-following ability and requirements can conflict. Standard prompt optimizers likewise provide little benefit. To address these issues, we propose requirements-aware prompt optimization mechanisms that improve performance by 4.8% on average over baselines. We further advocate for a systematic process of proactive requirements discovery, evaluation, and monitoring to better manage prompt underspecification in practice.
2025
SPHERE: An Evaluation Card for Human-AI Systems
Dora Zhao | Qianou Ma | Xinran Zhao | Chenglei Si | Chenyang Yang | Ryan Louie | Ehud Reiter | Diyi Yang | Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2025
Dora Zhao | Qianou Ma | Xinran Zhao | Chenglei Si | Chenyang Yang | Ryan Louie | Ehud Reiter | Diyi Yang | Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2025
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.