Mark Gahegan


2026

Large language models (LLMs) have expanded the potential for AI-assisted scientific claim verification, yet existing systems often exhibit unverifiable attributions, shallow evidence mapping, and hallucinated citations. We present SciTrue, a claim verification system providing source-level accountability and evidence traceability. SciTrue links each claim component to explicit, verifiable scientific sources, enabling users to inspect and challenge model inferences, addressing limitations of both general-purpose and search-augmented LLMs. In a human evaluation of 300 attributions, SciTrue achieves high fidelity in summary traceability, attribution accuracy, and context alignment, substantially outperforming RAG-based baselines such as GPT-4o-search-preview and Perplexity Sonar Pro. These results underscore the importance of principled attribution and context-aware reasoning in AI-assisted scientific verification. A system demo is available at .

2023

Training machine learning models to successfully perform scientific fact-checking tasks is challenging due to the expertise bottleneck that limits the availability of appropriate training datasets. In this task, models use textual evidence to confirm scientific claims, which requires data that contains extensive domain-expert annotation. Consequently, the number of existing scientific-fact-checking datasets and the sizes of those datasets are limited. However, these limitations do not apply to multiple-choice question datasets because of the necessity of domain exams in the modern education system. As one of the first steps towards addressing the fact-checking dataset scarcity problem in scientific domains, we propose a pipeline for automatically converting multiple-choice questions into fact-checking data, which we call Multi2Claim. By applying the proposed pipeline, we generated two large-scale datasets for scientific-fact-checking tasks: Med-Fact and Gsci-Fact for the medical and general science domains, respectively. These two datasets are among the first examples of large-scale scientific-fact-checking datasets. We developed baseline models for the verdict prediction task using each dataset. Additionally, we demonstrated that the datasets could be used to improve performance with respect to the F 1 weighted metric on existing fact-checking datasets such as SciFact, HEALTHVER, COVID-Fact, and CLIMATE-FEVER. In some cases, the improvement in performance was up to a 26% increase.