Xinman Liu


2026

Verifying complex real-world claims against diverse and potentially unreliable open-web sources requires balancing evidence comprehensiveness with rigorous source reliability. Current automated fact-checking approaches often fail to address this holistically, losing contextual dependencies and applying trust signals monolithically at the document level.We introduce ClaimCLAIRE, a multi-component fact-checking agent that integrates four key innovations: (1) iterative component-aware decomposition with exhaustiveness validation, (2) holistic evidence gathering using a ReAct agent that maintains cross-component semantic awareness, (3) trust-modulated retrieval that weights evidence by source credibility to mitigate the influence of misinformation, and (4) adaptive gap-filling to address recall bottlenecks in under-supported sub-claims.Evaluated on the AVeriTeC benchmark, ClaimCLAIRE achieves 84.27% accuracy and a macro-F1 of 0.806. Our systematic ablations demonstrate that while decomposition alone can degrade performance, its integration with trust-aware retrieval and adaptive gap-filling yields a pipeline where component-level verdicts, source trust ratings, and deterministic AND-logic synthesis together support transparent, accountable fact verification.
Most existing math benchmarks for LLMs focus on evaluating whether models produce correct solutions. In educational settings, however, it is equally important to understand whether LLMs grasp the pedagogical intent behind math problems, beyond simply arriving at the right answer. Tagging curriculum standards is challenging for the same reason: distinguishing fine-grained standards requires understanding subtle pedagogical distinctions. In this paper, we use the MathFish benchmark, which frames curriculum alignment as a multi-label prediction task over 385 Common Core State Standards, to evaluate a three-stage pipeline inspired by observed failure modes in retrieval and structural reasoning: curriculum-informed hard negatives (M1), a cross-encoder reranker (M2), and a ReAct agent paired with an LLM-as-a-judge critic (M3). We additionally evaluate a training-free alternative (A1) that combines hybrid sparse-dense retrieval with curriculum-graph reranking. M3 achieves 31.3% exact-match accuracy, approximately 6.5× higher than the three-shot GPT-4-Turbo baseline. Error analysis shows that, despite these improvements, the pipeline still struggles with missing predictions, grade-level misalignment, and sibling-standard confusion, reinforcing that precise curriculum alignment remains a fundamentally difficult problem in educational NLP.