Mateusz Bystroński


2026

We argue that LLM-based coding agents frequently fail to solve problems that lie within the model’s capacity and the bottleneck is often the conditioning context rather than the model itself. We formalize this for the full class of Turing-computable problems with verifiable specifications and introduce a framework that recasts coding as optimization overconditioning contexts that influence the generation of natural-languagesolution intentions. Guided by execution feedback, the method searches thiscontinuous context space to steer a coding agent toward correct solutions. The method operates as a plug-in layer that can wrap any coding agent without modifying its architecture or weights. On SWE-Bench Verified, our method raises the resolution rate of a weak, quantized 24B open-weight model to parity with frontier models +25× its size.
Starting from the observation that conditioning a poetry-writing prompt with a pancake recipe leads an LLM to produce a coherent poem incorporating pancake-related content and, more broadly, that such contexts arrange themselves into a structured semantic vector space, we argue that this renders the space explorable. By sampling it and using the resulting continuous representations to condition an LLM’s generation distribution, we can systematically expand the model’s reachable semantic range.We introduce a framework that requires no modification of LLM parameters and operationalizes this idea by constructing a conditioning distribution from a small set of diverse anchor generations. This distribution conditions LLM’s generation via an xRAG-style projector.Our experiments demonstrate that this manifold-based conditioning substantially increases generative diversity, with direct benefits for enhancing divergent thinking, a core facet of creativity, in language models.
Before a tax authority can issue a ruling, it must receive a complete description of the taxpayer’s situation—yet no benchmark measures whether language models can systematically elicit all relevant facts through dialogue.We introduce FSDBench (Factual State Discovery Benchmark), in which a discovery agent questions a simulated taxpayer grounded in a real tax document.The dataset comprises 500 narratives from official Polish tax interpretations, decomposed into 32 874 atomic facts with validated supported precision (97.6%), atomicity (93.8%), and sentence coverage (96.0%).Experiments with four models show that even the best system recovers only 77% of facts on easy samples and under 49% on hard samples after 50 turns.These findings establish conversational fact elicitation as a challenging open problem requiring retrieval-augmented and adaptive questioning strategies.

2025

Large Language Models (LLMs) are typically trained to predict the next token in a sequence. However, their internal representations often encode signals that go beyond immediate next-token prediction. In this work, we investigate whether these hidden states also carry information about the remaining length of the generated output—an implicit form of foresight (CITATION). We formulate this as a regression problem where, at generation step t, the target is the number of remaining tokens yt = T - t, with T as the total output length.We propose two approaches: (1) an aggregation-based model that combines hidden states from multiple transformer layers ℓ ∈ {8, …, 15} using element-wise operations such as mean or sum, and (2) a Layerwise Graph Regressor that treats layerwise hidden states as nodes in a fully connected graph and applies a Graph Neural Network (GNN) to predict yt. Both models operate on frozen LLM embeddings without requiring end-to-end fine-tuning.Accurately estimating remaining output length has both theoretical and practical implications. From an interpretability standpoint, it suggests that LLMs internally track their generation progress. From a systems perspective, it enables optimizations such as output-length-aware scheduling (CITATION). Our graph-based model achieves state-of-the-art performance on the Alpaca dataset using LLaMA-3-8B-Instruct, reducing normalized mean absolute error (NMAE) by over 50% in short-output scenarios.