Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

Mubashara Akhtar, Rami Aly, Christos Christodoulopoulos, Oana Cocarascu, Zhijiang Guo, Arpit Mittal, Michael Schlichtkrull, James Thorne, Andreas Vlachos (Editors)

Anthology ID:: 2025.fever-1
Month:: July
Year:: 2025
Address:: Vienna, Austria
Venues:: FEVER | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.fever-1/
DOI:
ISBN:: 978-1-959429-53-1
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.fever-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Automated Claim–Evidence Extraction for Political Discourse Analysis: A Large Language Model Approach to Rodong Sinmun Editorials
Gyuri Choi | Hansaem Kim

This study investigates the feasibility of automating political discourse analysis using large language models (LLMs), with a focus on 87 editorials from Rodong Sinmun, North Korea’s official newspaper. We introduce a structured analytical framework that integrates Chain-of-Thought prompting for claim–evidence extraction and a GPT-4o–based automated evaluation system (G-Eval). Experimental results demonstrate that LLMs possess emerging discourse-level reasoning capabilities, showing notably improved alignment with expert analyses under one-shot prompting conditions. However, the models often reproduced ideological rhetoric uncritically or generated interpretive hallucinations, highlighting the risks of fully automated analysis. To address these issues, we propose a Hybrid Human-in-the-Loop evaluation framework that combines expert judgment with automated scoring. This study presents a novel approach to analyzing politically sensitive texts and offers empirical insights into the quantitative assessment of ideological discourse, underscoring the scalability and potential of automation-driven methodologies.

Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.

pdf bib abs
Portuguese Automated Fact-checking: Information Retrieval with Claim extraction
Juliana Gomes | Eduardo Garcia | Arlindo Rodrigues Galvão Filho

Current Portuguese Automated Fact-Checking (AFC) research often relies on datasets lacking integrated external evidence crucial for comprehensive verification. This study addresses this gap by systematically enriching Portuguese misinformation datasets. We retrieve web evidence by simulating user information-seeking behavior, guided by core claims extracted using Large Language Models (LLMs). Additionally, we apply a semi-automated validation framework to enhance dataset reliability.Our analysis reveals that inherent dataset characteristics impact data properties, evidence retrieval, and AFC model performance. While enrichment generally improves detection, its efficacy varies, influenced by challenges such as self-reinforcing online misinformation and API limitations. This work contributes enriched datasets, associating original texts with retrieved evidence and LLM-extracted claims, to foster future evidence-based fact-checking research.The code and enriched data for this study is available at https://github.com/ju-resplande/pt_afc.

pdf bib abs
Multilingual Symptom Detection on Social Media: Enhancing Health-related Fact-checking with LLMs
Saidah Zahrotul Jannah | Elyanah Aco | Shaowen Peng | Shoko Wakamiya | Eiji Aramaki

Social media has emerged as a valueable source for early pandemic detection, as repeated mentions of symptoms by users may signal the onset of an outbreak. However, to be a reliable system, validation through fact-checking and verification against official health records is essential. Without this step, systems risk spreading misinformation to the public. The effectiveness of these systems also depend on their ability to process data in multiple languages, given the multilingual nature of social media data.Yet, many NLP datasets and disease surveillance system remain heavily English-centric, leading to significant performance gaps for low-resource languages.This issue is especially critical in Southeast Asia, where symptom expression may vary culturally and linguistically.Therefore, this study evaluates the symptom detection capabilities of LLMs in social media posts across multiple languages, models, and symptoms to enhance health-related fact-checking. Our results reveal significant language-based discrepancies, with European languages outperforming under-resourced Southeast Asian languages. Furthermore, we identify symptom-specific challenges, particularly in detecting respiratory illnesses such as influenza, which LLMs tend to overpredict.The overestimation or misclassification of symptom mentions can lead to false alarms or public misinformation when deployed in real-world settings. This underscores the importance of symptom detection as a critical first step in medical fact-checking within early outbreak detection systems.

pdf bib abs
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification
Hanna Shcharbakova | Tatiana Anikina | Natalia Skachkova | Josef Van Genabith

The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.

pdf bib abs
Less Can be More: An Empirical Evaluation of Small and Large Language Models for Sentence-level Claim Detection
Andrew Bell

Sentence-level claim detection is a critical first step in the fact-checking process. While Large Language Models (LLMs) seem well-suited for claim detection, their computational cost poses challenges for real-world deployment. This paper investigates the effectiveness of both small and large pretrained Language Models for the task of claim detection. We conduct a comprehensive empirical evaluation using BERT, ModernBERT, RoBERTa, Llama, and ChatGPT-based models. Our results reveal that smaller models, when finetuned appropriately, can achieve competitive performance with significantly lower computational overhead on in-domain tasks. Notably, we also find that BERT-based models transfer poorly on sentence-level claim detection in out-of-domain tasks. We discuss the implications of these findings for practitioners and highlight directions for future research.

pdf bib abs
RAG based Question Answering of Korean Laws and Precedents
Kiho Seo | Takehito Utsuro

We propose a method of improving the performance of question answering based on the interpretation of criminal law regulations in the Korean language by using large language models. In this study, we develop a system that accumulates legislative texts and case precedents related to criminal procedures published on the Internet.The system searches for relevant legal provisions and precedents related to the queryunder the RAG (Retrieval-Augmented Generation) framework.It generates accurate responses to questions by conducting reasoning through large language modelsbased on these relevant laws and precedents. As an application example of this system, it can be utilized to support decision makingin investigations and legal interpretation scenarios within the field of Korean criminal law.

pdf bib abs
FACT5: A Novel Benchmark and Pipeline for Nuanced Fact-Checking of Complex Statements
Shayan Chowdhury | Sunny Fang | Smaranda Muresan

Fact-checking complex statements is integral to combating misinformation, but manual approaches are time-consuming, while automated approaches often oversimplify truthfulness into binary classifications and rely on resource-intensive models. This paper introduces: (i) FACT5, a curated dataset of 150 real-world statements with five ordinal classes of truthfulness, designed to capture the nuanced nature of factual accuracy and (ii) an open-source end-to-end pipeline using large language models (LLMs) that decomposes statements into atomic claims, generates targeted questions, retrieves evidence from the web, and produces justified verdicts. We evaluate our pipeline on FACT5 using Mistral-7B-v0.3 and Google’s Gemini-1.5-Flash. Our findings demonstrate significant improvements over baseline LLM performance, with Mistral-7B showing a 71.9% reduction in MSE for pass@3 evaluation. The FACT5 dataset, pipeline implementation, and evaluation framework are anonymized and provided at https://github.com/shayantist/FACT5/, and a demo of the pipeline can be interacted with at https://fact5check.streamlit.app/.

pdf bib abs
Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge
Juraj Vladika | Ihsan Soydemir | Florian Matthes

While large language models (LLMs) have shown remarkable capabilities to generate coherent text, they suffer from the issue of hallucinations – factually inaccurate statements. Among numerous approaches to tackle hallucinations, especially promising are the self-correcting methods. They leverage the multi-turn nature of LLMs to iteratively generate verification questions inquiring additional evidence, answer them with internal or external knowledge, and use that to refine the original response with the new corrections. These methods have been explored for encyclopedic generation, but less so for domains like news summaries. In this work, we investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries, using evidence from three search engines. We analyze the results and provide insights into systems’ performance, revealing interesting practical findings on the benefits of search engine snippets and few-shot prompts, as well as high alignment of G-Eval and human evaluation.

Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model’s dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on the overshadowing effect, we propose a new decoding strategy CoDA, to mitigate hallucinations, which notably enhances model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.

pdf bib abs
GQC: LLM-Based Grouped QA Consolidation for Open-Domain Fact Verification at AVeriTeC
Dongzhuoran Zhou | Roxana Pop | Yuqicheng Zhu | Evgeny Kharlamov

Structured fact verification benchmarks like AVeriTeC decompose claims into QA pairs to support fine-grained reasoning. However, current systems generate QA pairs independently for each evidence sentence, leading to redundancy, drift, and noise. We introduce a modular LLM-based QA consolidation module that jointly filters, clusters, and rewrites QA pairs at the claim level. Experiments show that this method improves evidence quality and veracity prediction accuracy. Our analysis also highlights the impact of model scale and alignment on downstream performance.

pdf bib abs
(Fact) Check Your Bias
Eivind Morris Bakke | Nora Winger Heggelund

Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1’s parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as “Not Enough Evidence”. Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: https://github.com/eibakke/FEVER-8-Shared-Task

pdf bib abs
EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions
Spencer Hong | Meng Luo | Xinyi Wan

Determining the veracity of atomic claims is an imperative component of many recently proposed fact-checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi-agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi-agent framework. Our code is available at https://github.com/qqqube/EMULATE.

pdf bib abs
SemQA: Evaluating Evidence with Question Embeddings and Answer Entailment for Fact Verification
Kjetil Indrehus | Caroline Vannebo | Roxana Pop

Automated fact-checking (AFC) of factual claims require efficiency and accuracy. Existing evaluation frameworks like Ev²R achieve strong semantic grounding but incur substantial computational cost, while simpler metrics based on overlap or one-to-one matching often misalign with human judgments. In this paper, we introduce SemQA, a lightweight and accurate evidence-scoring metric that combines transformer-based question scoring with bidirectional NLI entailment on answers. We evaluate SemQA by conducting human evaluations, analyzing correlations with existing metrics, and examining representative examples.

In the First Automated Verification of Textual Claims (AVeriTeC) shared task participanting teams developed systems that for each claim retrieve evidence from the web and predict its veracity. While there was progress in automated fact-checking for real-world claims, the majority of the systems proposed relied on closed-weights large language models, which rendered them expensive to run and less reporducible. To ameliorate this issue, in this year’s edition of the AVERITEC shared task we required system to use only open-weights models that could be run use a single GPU with 23GBs of RAM, and that systems should take one minute or less to return verdicts accompanied by evidence retrieved from a precompiled knowledge store. The shared task received 7 submissions; 6 of which exceeded the accuracy of our baseline on the test set, while they ran in under a minute per claim on the hardware we had speficied. The winning team was CTU AIC with an AVeriTeC score of 33.17%. In this paper we describe the shared task in detail and highlight key findings.

pdf bib abs
Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification
Yejun Yoon | Jaeyoon Jung | Seunghyun Yoon | Kunwoo Park

This paper presents HerO 2, Team HUMANE’s system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year’s challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at https://github.com/ssu-humane/HerO2.

Given the limited computational and financial resources of news agencies, real-life usage of fact-checking systems requires fast response times. For this reason, our submission to the FEVER-8 claim verification shared task focuses on optimizing the efficiency of such pipelines built around subtasks such as evidence retrieval and veracity prediction. We propose the Semantic Filtering for Efficient Fact Checking (SFEFC) strategy, which is inspired by the FEVER-8 baseline and designed with the goal of reducing the number of LLM calls and other computationally expensive subroutines. Furthermore, we explore the reuse of cosine similarities initially calculated within a dense retrieval step to retrieve the top 10 most relevant evidence sentence sets. We use these sets for semantic filtering methods based on similarity scores and create filters for particularly hard classification labels “Not Enough Information” and “Conflicting Evidence/Cherrypicking” by identifying thresholds for potentially relevant information and the semantic variance within these sets. Compared to the parallelized FEVER-8 baseline, which takes 33.88 seconds on average to process a claim according to the FEVER-8 shared task leaderboard, our non-parallelized system remains competitive in regard to AVeriTeC retrieval scores while reducing the runtime to 7.01 seconds, achieving the fastest average runtime per claim.

In this paper, we present the system proposed by our team OldJoe, for the 8th edition of the AVeriTeC shared task, as part of the FEVER workshop. The objective of this task is to verify the factuality of real-world claims. Our approach integrates open source large language models, SQL, and in-context learning. We begin with embedding the knowledge store using a pretrained embedding language model then storing the outputs in a SQL database. Subsequently, we prompt an LLM to craft relevant questions based on the input claim, which are then used to guide the retrieval process. We further prompt the LLM to generate answers to the questions and predict the veracity of the original claim. Our system scored 0.49 on the HU-METEOR AVeriTeC score on the dev set and 0.15 on the Ev2R recall on the test set. Due to the time constraint we were unable to conduct additional experiments or further hyperparameter tuning. As a result, we adopted this pipeline configuration centered on the Qwen3-14B-AWQ model as our final submission strategy. The full pipeline is available on GitHub: https://github.com/farahft/OldJoe

pdf bib abs
SANCTUARY: An Efficient Evidence-based Automated Fact Checking System
Arbaaz Dharmavaram | Saqib Hakak

With the growing volume of misinformation online, automated fact-checking systems are becoming increasingly important. This paper presents SANCTUARY, an efficient pipeline for evidence-based verification of real-world claims. Our approach consists of three stages: Hypothetical Question & Passage Generation, a two-step Retrieval-Augmented Generation (RAG) hybrid evidence retrieval, and structured reasoning and prediction, which leverages two lightweight Large Language Models (LLMs). On the challenging AVeriTeC benchmark, our system achieves 25.27 points on the new AVeriTeC score (Ev2R recall), outperforming the previous state-of-the-art baseline by 5 absolute points (1.25× relative improvement). Sanctuary demonstrates that careful retrieval, reasoning strategies and well-integrated language models can substantially advance automated fact-checking performance.

pdf bib abs
Fathom: A Fast and Modular RAG Pipeline for Fact-Checking
Farrukh Bin Rashid | Saqib Hakak

We present Fathom, a Retrieval-Augmented Generation (RAG) pipeline for automated fact-checking, built entirely using lightweight open-source language models. The system begins with HyDE-style question generation to expand the context around each claim, followed by a dual-stage retrieval process using BM25 and semantic similarity to gather relevant evidence. Finally, a lightweight LLM performs veracity prediction, producing both a verdict and supporting rationale. Despite relying on smaller models, our system achieved an AVeriTeC score of 0.2043 on the test set, a 0.99% absolute improvement over the baseline and 0.378 on the dev set, marking a 27.7% absolute improvement.

pdf bib abs
Graph-of-Thoughts for Fact-Checking with Large Language Models
Sascha Rolinger | Jin Liu

We present a fact-checking system developed for the 2025 Automated Verification of Textual Claims (AVeriTeC) shared task, leveraging the Graph-of-Thoughts (GoT) prompting scheme. The GoT approach facilitates iterative refinement during fact-checking by conditioningquestion generation on previous answers and enabling the incorporation of multiple evidence documents per question, thereby mitigatingthe impact of factually incorrect evidence. The efficiency requirements of the shared task are addressed by restricting the width and depthof the thought graph. Additionally, an efficient stopping criterion is derived from the dataset’s Not Enough Information (NEI) label. Our system utilizes fine-tuned open-source Large Language Models (LLMs) for question generation, question answering, and final verdict prediction. Empirical results demonstrate competitive performance against top-performing systems in the AVeriTeC shared task and improvements over the baseline method. Our code is publicly available.

pdf bib abs
AIC CTU@FEVER 8: On-premise fact checking through long context RAG
Herbert Ullrich | Jan Drchal

In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year’s submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single Nvidia A10 GPU, 23GB of graphical memory and 60s running time per claim.