Pavel Tikhonov
2026
Confidence Leaps in LLM Reasoning: Early Stopping and Cross-Model Transfer
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
We challenge the common assumption that Large Language Models (LLMs) build confidence gradually during reasoning. Instead, we find that conviction is often reached in a discrete "moment of insight", characterized by a sudden and sharp increase in an answer’s probability-a phenomenon we term a "confidence leap". Leveraging this discovery, we introduce a training-free, model-agnostic early-stopping heuristic that halts generation upon detecting such a leap, significantly reducing the generation length without sacrificing accuracy. We also demonstrate that the reasoning text leading up to this leap is semantically potent and transferable: feeding this partial reasoning to a different model family substantially boosts its performance. This suggests that the "confidence leap" marks a shared, interpretable reasoning milestone, not just a model-specific statistical artifact.
One Task Vector is not Enough: A Large-Scale Study for In-Context Learning
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors, defined as specific hidden state activations hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QᴜɪᴛᴇAFᴇᴡ, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QᴜɪᴛᴇAFᴇᴡ reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.