Bei Wang


2026

People often rely on large language models (LLMs) in situations where they are ill-suited. This miscalibration is understandable: seeing LLMs compose poetry and answer complex questions can lead users to assume, incorrectly, that they will also handle simple tasks, such as basic arithmetic, without error. Prior work has attempted to address this issue by clustering instance embeddings to identify regions where an LLM is likely to fail, then automatically describing the patterns within those regions. These inferred “failure patterns” are taught to users to reduce overreliance. Yet, this approach has not been fully successful. In this paper, we investigate why.We first examine whether the negative results stem from an absence of meaningful failure patterns. Using two datasets, we group instances by their meta-labels and evaluate LLM performance within each group. We then define criteria to identify groups that are both sufficiently large and exhibit high error rates. This process reveals multiple meta-label groups that meet these criteria, indicating that actionable failure patterns do, in fact, exist. Next, we test whether prompting- and embedding-based methods can reliably surface these known failure patterns. This step is critical: if such patterns cannot be surfaced automatically, they cannot be communicated to users. We observe mixed performance across methods, which may explain the limited success of prior approaches. Finally, we revisit how teaching effectiveness is measured. We propose evaluating whether users can apply learned failure patterns to anticipate when an LLM is likely to err. A user study shows that instruction based on this metric yields measurable improvements, unlike standard human–AI team accuracy metrics. Overall, our findings suggest that teaching failure patterns can be an effective way to mitigate overreliance, but its success depends on improved automated methods for discovering these patterns and on evaluation metrics like ours.

2024

By allowing models to predict without task-specific training, in-context learning (ICL) with pretrained LLMs has enormous potential in NLP. However, a number of problems persist in ICL. In particular, its performance is sensitive to the choice and order of in-context examples. Given the same set of in-context examples with different orderings, model performance may vary from near random to near state-of-the-art. In this work, we formulate in-context example ordering as an optimization problem. We examine three problem settings that differ in the assumptions they make about what is known about the task. Inspired by the idea of learning from label proportions, we propose two principles for in-context example ordering guided by model’s probability predictions. We apply our proposed principles to thirteen text classification datasets and nine different autoregressive LLMs with 700M to 13B parameters. We demonstrate that our approach outperforms the baselines by improving the classification accuracy, reducing model miscalibration, and also by selecting better in-context examples.