Keyi Wang

Other people with similar names: Keyi Wang

Unverified author pages with similar names: Keyi Wang


2026

Insurance claims adjudication demands not only accurate decisions but also interpretable reasoning grounded in policy clauses. However, existing benchmarks are limited to information retrieval or simple multiple-choice setups, which fail to require step-by-step inferences from facts to conclusions. To address this gap, we introduce InsLogicBench, a benchmark providing complete reasoning traces that link factual inputs, relevant policy clauses, and final verdicts. We construct the dataset using a controllable synthesis framework based on the Nested Toulmin Model. By capturing the defeasible logic of insurance policies through hierarchical truth assignment and enforcing validity via consistency verification, we ensure interpretability and logical rigor across generated examples. We evaluate eight Large Language Models (LLMs) on InsLogicBench. Results show significant difficulties in handling exception clauses and verifying missing conditions. Notably, models often produce correct final decisions but fail to provide precise justifications, highlighting a critical discrepancy between their decision accuracy and logical reasoning capabilities.
While extensive research has evaluated LLMs on complex reasoning tasks, the foundational building blocks of logical reasoning remain underexplored. We introduce IIBench, a benchmark evaluating immediate inference (elementary operations over categorical propositions). Our evaluation reveals that even SoTA models exhibit systematic deficiencies in immediate inference, and establishes immediate inference as foundational: it mediates approximately 40% of the effect on syllogistic reasoning, with near-perfect correlation ( = 0.98) across reasoning benchmarks. Our analysis reveals that models lack robust operator grounding, oscillating between structural reasoning and surface pattern matching with inconsistent handling of quantifiers and negation.