Juntai Cao

2025

pdf bib abs
Why Prompt Design Matters and Works: A Complexity Analysis of Prompt Search Space in LLMs
Xiang Zhang | Juntai Cao | Chenyu You | Dujian Ding
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the remarkable successes of Large Language Models (LLMs), the underlying Transformer architecture has inherent limitations in handling complex reasoning tasks. Chain-of-Thought (CoT) prompting has emerged as a practical workaround, but most CoT-based methods rely on a single generic prompt like “think step by step,” with no task-specific adaptation. These approaches expect the model to discover an effective reasoning path on its own, forcing it to search through a vast prompt space. In contrast, many work has explored task-specific prompt designs to boost performance. However, these designs are typically developed through trial and error, lacking a theoretical ground. As a result, prompt engineering remains largely ad hoc and unguided.In this paper, we provide a theoretical framework that explains why some prompts succeed while others fail. We show that prompts function as selectors, extracting specific task-relevant information from the model’s full hidden state during CoT reasoning. Each prompt defines a unique trajectory through the answer space, and the choice of this trajectory is crucial for task performance and future navigation in the answer space.We analyze the complexity of finding optimal prompts and the size of the prompt space for a given task. Our theory reveals principles behind effective prompt design and shows that naive CoT—using model-self-guided prompt like “think step by step” —can severely hinder performance. Showing that optimal prompt search can lead to over a 50% improvement on reasoning tasks through experiments, our work provide a theoretical foundation for prompt engineering.

Recent advances in test-time scaling have shown promising results in improving large language model performance through strategic computation allocation during inference. While this approach has demonstrated strong improvements in reasoning tasks, its application to natural language generation tasks, particularly summarization, remains unexplored.Among all of the generation tasks, multi-document summarization (MDS) presents unique challenges by requiring models to extract and synthesize essential information across multiple lengthy documents. Unlike reasoning tasks, MDS demands a more complicated approach to prompt design and ensemble methods, as no single “best-overall” prompt can satisfy diverse summarization requirements. The inherent diversity in summarization needs necessitates exploring how different prompting strategies can be systematically combined to improve performance.We propose a novel framework that harnesses prompt diversity to enhance MDS performance. Our approach generates multiple candidate summaries using carefully designed prompt variations, then ensemble them through sophisticated aggregation methods to produce refined summaries. This prompt diversity enables models to capture different aspects and perspectives of the source documents, leading to more comprehensive and higher-quality summaries. To evaluate our method effectively, we also introduce two new LLM-based metrics: the Preference Alignment Score (PAS) and LLM Atom-Content-Unit score (LLM-ACU), which assess summary quality while addressing the positional bias inherent in automatic evaluations performed by LLMs.Our experiments demonstrate that leveraging prompt diversity significantly enhances summary quality, while also revealing the practical scaling boundaries for MDS tasks.

Co-authors

Chuyuan Li 1

Jiaqi Wei 1

Chenyu You 1

Venues

Fix author