Tomasz Jan Kajdanowicz

2026

FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs
Albert Sawczyn | Jakub Binkowski | Denis Janiak | Bogdan Gabrys | Tomasz Jan Kajdanowicz
Findings of the Association for Computational Linguistics: EACL 2026

Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel zero-resource black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as interpretable knowledge graphs consisting of facts in the form of triples, providing clearer insights into content factuality than traditional approaches. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sentence-level sampling-based methods while providing more detailed and interpretable insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35.5% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only a 10.6% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content. Additionally, we contribute FavaMultiSamples, a novel dataset that addresses a gap in the field by providing the research community with a second dataset for evaluating sampling-based methods.

pdf bib abs

Beyond Discrete Search: Divergent Thinking as Intention Optimization in Latent Space
Mateusz Bystroński | Grzegorz Piotrowski | Tomasz Jan Kajdanowicz
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

We argue that LLM-based coding agents frequently fail to solve problems that lie within the model’s capacity and the bottleneck is often the conditioning context rather than the model itself. We formalize this for the full class of Turing-computable problems with verifiable specifications and introduce a framework that recasts coding as optimization overconditioning contexts that influence the generation of natural-languagesolution intentions. Guided by execution feedback, the method searches thiscontinuous context space to steer a coding agent toward correct solutions. The method operates as a plug-in layer that can wrap any coding agent without modifying its architecture or weights. On SWE-Bench Verified, our method raises the resolution rate of a weak, quantized 24B open-weight model to parity with frontier models +25× its size.

pdf bib abs

Continuous Context Sampling Allows Extending Diversity Boundaries of Large Language Models
Mateusz Bystroński | Doheon Han | Nitesh V. Chawla | Tomasz Jan Kajdanowicz
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Starting from the observation that conditioning a poetry-writing prompt with a pancake recipe leads an LLM to produce a coherent poem incorporating pancake-related content and, more broadly, that such contexts arrange themselves into a structured semantic vector space, we argue that this renders the space explorable. By sampling it and using the resulting continuous representations to condition an LLM’s generation distribution, we can systematically expand the model’s reachable semantic range.We introduce a framework that requires no modification of LLM parameters and operationalizes this idea by constructing a conditioning distribution from a small set of diverse anchor generations. This distribution conditions LLM’s generation via an xRAG-style projector.Our experiments demonstrate that this manifold-based conditioning substantially increases generative diversity, with direct benefits for enhancing divergent thinking, a core facet of creativity, in language models.

pdf bib abs

Factual State Discovery Benchmark: Evaluating Fact Elicitation in Polish Tax Law
Mateusz Bystroński | Kamil Tagowski | Denis Janiak | Julia Farganus | Lukasz Augustyniak | Monika Kajdanowicz | Tomasz Jan Kajdanowicz
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Before a tax authority can issue a ruling, it must receive a complete description of the taxpayer’s situation—yet no benchmark measures whether language models can systematically elicit all relevant facts through dialogue.We introduce FSDBench (Factual State Discovery Benchmark), in which a discovery agent questions a simulated taxpayer grounded in a real tax document.The dataset comprises 500 narratives from official Polish tax interpretations, decomposed into 32 874 atomic facts with validated supported precision (97.6%), atomicity (93.8%), and sentence coverage (96.0%).Experiments with four models show that even the best system recovers only 77% of facts on easy samples and under 49% on hard samples after 50 turns.These findings establish conversational fact elicitation as a challenging open problem requiring retrieval-augmented and adaptive questioning strategies.

2025

pdf bib abs

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Denis Janiak | Jakub Binkowski | Albert Sawczyn | Bogdan Gabrys | Ravid Shwartz-Ziv | Tomasz Jan Kajdanowicz
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.

pdf bib abs

Hallucination Detection in LLMs Using Spectral Features of Attention Maps
Jakub Binkowski | Denis Janiak | Albert Sawczyn | Bogdan Gabrys | Tomasz Jan Kajdanowicz
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks but remain prone to hallucinations. Detecting hallucinations is essential for safety-critical applications, and recent methods leverage attention map properties to this end, though their effectiveness remains limited. In this work, we investigate the spectral features of attention maps by interpreting them as adjacency matrices of graph structures. We propose the LapEigvals method, which utilises the top-k eigenvalues of the Laplacian matrix derived from the attention maps as an input to hallucination detection probes. Empirical evaluations demonstrate that our approach achieves state-of-the-art hallucination detection performance among attention-based methods. Extensive ablation studies further highlight the robustness and generalisation of LapEigvals, paving the way for future advancements in the hallucination detection domain.

pdf bib abs

When Will the Tokens End? Graph-Based Forecasting for LLMs Output Length
Grzegorz Piotrowski | Mateusz Bystroński | Mikołaj Hołysz | Jakub Binkowski | Grzegorz Chodak | Tomasz Jan Kajdanowicz
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large Language Models (LLMs) are typically trained to predict the next token in a sequence. However, their internal representations often encode signals that go beyond immediate next-token prediction. In this work, we investigate whether these hidden states also carry information about the remaining length of the generated output—an implicit form of foresight (CITATION). We formulate this as a regression problem where, at generation step t, the target is the number of remaining tokens y_t = T - t, with T as the total output length.We propose two approaches: (1) an aggregation-based model that combines hidden states from multiple transformer layers ℓ ∈ {8, …, 15} using element-wise operations such as mean or sum, and (2) a Layerwise Graph Regressor that treats layerwise hidden states as nodes in a fully connected graph and applies a Graph Neural Network (GNN) to predict y_t. Both models operate on frozen LLM embeddings without requiring end-to-end fine-tuning.Accurately estimating remaining output length has both theoretical and practical implications. From an interpretability standpoint, it suggests that LLMs internally track their generation progress. From a systems perspective, it enables optimizations such as output-length-aware scheduling (CITATION). Our graph-based model achieves state-of-the-art performance on the Alpaca dataset using LLaMA-3-8B-Instruct, reducing normalized mean absolute error (NMAE) by over 50% in short-output scenarios.