Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

Anthology ID:: 2025.sdp-1
Month:: July
Year:: 2025
Address:: Venice, Austria
Venues:: sdp | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.sdp-1/
DOI:
ISBN:: 979-8-89176-265-7
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.sdp-1.pdf

pdf bib
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Amanpreet Singh

The workshop on Scholarly Document Processing (SDP) started in 2020 to accelerate research, inform policy, and educate the public on natural language processing for scientific text. The fifth iteration of the workshop, SDP 2025 was held at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) in Vienna as a hybrid event. The workshop saw a% in OpenReview, there is no paper ID with 1. If paper with ID 1 is not counted, then there is 25 submissions in total.great increase in interest, with 26 submissions, of which 11 were accepted for the research track. The program consisted of a research track, xx invited talks and four shared tasks: (1) SciHal25: Hallucination Detection for Scientific Content, (2) SciVQA: Scientific Visual Question Answering, (3) ClimateCheck: Scientific Fact-checking of Social Media Posts on Climate Change, and (4) Software Mention Detection in Scholarly Publications (SOMD 25). In addition to the four shared task overview papers, xx shared task reports were accepted. The program was geared towards NLP, information extraction, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.

pdf bib abs
TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs
Sahil Kale | Vijaykant Nadadur

LaTeX’s precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available on GitHub at https://github.com/knowledge-verse-ai/TeXpert.

pdf bib abs
MathD2: Towards Disambiguation of Mathematical Terms
Shufan Jiang | Mary Ann Tan | Harald Sack

In mathematical literature, terms can have multiple meanings based on context. Manual disambiguation across scholarly articles demands massive efforts from mathematicians. This paper addresses the challenge of automatically determining whether two definitions of a mathematical term are semantically different. Specifically, the difficulties and how contextualized textual representation can help resolve the problem, are investigated. A new dataset MathD2 for mathematical term disambiguation is constructed with ProofWiki’s disambiguation pages. Then three approaches based on the contextualized textual representation are studied: (1) supervised classification based on the embedding of concatenated definition and title; (2) zero-shot prediction based on semantic textual similarity(STS) between definition and title and (3) zero-shot LLM prompting. The first two approaches achieve accuracy greater than 0.9 on the ground truth dataset, demonstrating the effectiveness of our methods for the automatic disambiguation of mathematical definitions. Our dataset and source code are available here: https://github.com/sufianj/MathTermDisambiguation.

The translation of basic science into clinical interventions represents a critical yet prolonged pathway in biomedical research, with significant implications for human health. While previous translation prediction approaches have focused on citation-based and metadata metrics or semantic analysis, the complex network structure of scientific knowledge remains under-explored. In this work, we present a novel graph neural network approach that leverages both semantic and structural information to predict which research publications will lead to clinical trials. Our model analyses a comprehensive dataset of 19 million publication nodes, using transformer-based title and abstract sentence embeddings within their citation network context. We demonstrate that our graph-based architecture, which employs attention mechanisms over local citation neighbourhoods, outperforms traditional convolutional approaches by effectively capturing knowledge flow patterns (F1 improvement of 4.5 and 3.5 percentage points for direct and indirect translation). Our metadata is carefully selected to eliminate potential biases from researcher-specific information, while maintaining predictive power through network structural features. Notably, our model achieves state-of-the-art performance using only content-based features, showing that language inherently captures many of the predictive features of translation. Through rigorous validation on a held-out time window (2021), we demonstrate generalisation across different biomedical domains and provide insights into early indicators of translational research potential. Our system offers immediate practical value for research funders, enabling evidence-based assessment of translational potential during grant review processes.

pdf bib abs
The ClimateCheck Dataset: Mapping Social Media Claims About Climate Change to Corresponding Scholarly Articles
Raia Abu Ahmad | Aida Usmanova | Georg Rehm

The rapid spread of misinformation on and through social media poses a significant challenge to public understanding of climate change and evidence-based policymaking. While natural language processing techniques have been used to analyse online discourse on climate change, no existing resources link social media claims to scientific literature. Thus, we introduce ClimateCheck, a human-annotated dataset that connects 435 unique, climate-related English claims in lay language to scientific abstracts. Each claim is connected to at least one and at most seventeen abstracts, resulting in 3,048 annotated claim-abstract pairs. The dataset aims to facilitate fact-checking and claim verification by leveraging scholarly document processing to improve access to scientific evidence in online discussions about climate change.

pdf bib abs
Analyzing the Evolution of Scientific Misconduct Based on the Language of Retracted Papers
Christof Bless | Andreas Waldis | Angelina Parfenova | Maria A. Rodriguez | Andreas Marfurt

Amid rising numbers of organizations producing counterfeit scholarly articles, it is important to quantify the prevalence of scientific misconduct.We assess the feasibility of automated text-based methods to determine the rate of scientific misconduct by analyzing linguistic differences between retracted and non-retracted papers.We find that retracted works show distinct phrase patterns and higher word repetition.Motivated by this, we evaluatetwo misconduct detection methods, a mixture distribution approach and a Transformer-based one.The best models achieve high accuracy (>0.9 F1) on detection of paper mill articles and automatically generated content, making them viable tools for flagging papers for closer review.We apply the classifiers to more than 300,000 paper abstracts, to quantify misconduct over time and find that our estimation methods accurately reproduce trends observed in the real data.

Recent years in NLP have seen the continued development of domain-specific information extraction tools for scientific documents, alongside the release of increasingly multimodal pretrained language models. While applying and evaluating these new, general-purpose language model systems in specialized domains has never been easier, it remains difficult to compare them with models developed specifically for those domains, which tend to accept a narrower range of input formats, and are difficult to evaluate in the context of the original documents. Meanwhile, the general-purpose systems are often black-box and give little insight into preprocessing (like conversion to plain text or markdown) that can have significant downstream impact on their results.In this work, we present Collage, a tool intended to facilitate the co-design of information extraction systems on scientific PDFs between NLP developers and scientists by facilitating the rapid prototyping, visualization, and comparison of different information extraction models on scientific PDFs, regardless of their input modality. For scientists, Collage provides side-by-side visualization and comparison of multiple models of different input and output modalities in the context of the PDF content they are applied to; for developers, Collage allows the rapid deployment of new models by abstracting away PDF preprocessing and visualization into easily extensible software interfaces. Further, we enable both developers and scientists to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing. We demonstrate our system in the context of information extraction to assist with literature review in materials science.

pdf bib abs
Literature discovery with natural language queries
Anna Kiepura | Jessica Lam | Nianlong Gu | Richard Hahnloser

Literature discovery is a critical component of scientific research. Modern discovery systems leveraging Large Language Models (LLMs) are increasingly adopted for their ability to process natural language queries (NLQs). To assess the robustness of such systems, we compile two NLQ datasets and submit them to nine widely used discovery platforms. Our findings reveal that LLM-based search engines struggle with precisely formulated queries, often producing numerous false positives. However, precision improves when LLMs are used not for direct retrieval but to convert NLQs into structured keyword-based queries. As a result, hybrid systems that integrate both LLM-driven and keyword-based approaches outperform purely keyword-based or purely LLM-based discovery methods.

Automated scientific idea generation systems have made remarkable progress, yet the automatic evaluation of idea novelty remains a critical and underexplored challenge. Manual evaluation of novelty through literature review is labor-intensive, prone to error due to subjectivity, and impractical at scale. To address these issues, we propose the **Idea Novelty Checker**, an LLM-based retrieval-augmented generation (RAG) framework that leverages a two-stage retrieve-then-rerank approach. The Idea Novelty Checker first collects a broad set of relevant papers using keyword and snippet-based retrieval, then refines this collection through embedding-based filtering followed by facet-based LLM re-ranking. It incorporates expert-labeled examples to guide the system in comparing papers for novelty evaluation and in generating literature-grounded reasoning. Our extensive experiments demonstrate that our novelty checker achieves approximately 13% higher agreement than existing approaches. Ablation studies further showcases the importance of the facet-based re-ranker in identifying the most relevant literature for novelty evaluation.

pdf bib abs
Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature
Pietro Marini | Aécio Santos | Nicole Contaxis | Juliana Freire

Despite growing emphasis on data sharing and the proliferation of open datasets, researchers face significant challenges in discovering relevant datasets for reuse and systematically identifying dataset references within scientific literature. We present Data Gatherer, an automated system that leverages large language models to identify and extract dataset references from scientific publications. To evaluate our approach, we developed and curated two high-quality benchmark datasets specifically designed for dataset identification tasks. Our experimental evaluation demonstrates that Data Gatherer achieves high precision and recall in automated dataset reference extraction, reducing the time and effort required for dataset discovery while improving the systematic identification of data sources in scholarly literature.

pdf bib abs
Predicting The Scholarly Impact of Research Papers Using Retrieval-Augmented LLMs
Tamjid Azad | Ibrahim Al Azher | Sagnik Ray Choudhury | Hamed Alhoori

Assessing a research paper’s scholarly impact is an important phase in the scientific research process; however, metrics typically take some time after publication to accurately capture the impact. Our study examines how Large Language Models (LLMs) can predict scholarly impact accurately. We utilize Retrieval-Augmented Generation (RAG) to examine the degree to which the LLM performance improves compared to zero-shot prompting. Results show that LLama3-8b with RAG achieved the best overall performance, while Gemma-7b benefited the most from RAG, exhibiting the most significant reduction in Mean Absolute Error (MAE). Our findings suggest that retrieval-augmented LLMs offer a promising approach for early research evaluation. Our code and dataset for this project are publicly available.

pdf bib abs
Document Attribution: Examining Citation Relationships using Large Language Models
Vipula Rawte | Ryan A. Rossi | Franck Dernoncourt | Nedim Lipka

As Large Language Models (LLMs) are increasingly applied to document-based tasks - such as document summarization, question answering, and information extraction - where user requirements focus on retrieving information from provided documents rather than relying on the model’s parametric knowledge, ensuring the trustworthiness and interpretability of these systems has become a critical concern. A central approach to addressing this challenge is attribution, which involves tracing the generated outputs back to their source documents. However, since LLMs can produce inaccurate or imprecise responses, it is crucial to assess the reliability of these citations.To tackle this, our work proposes two techniques. (1) A zero-shot approach that frames attribution as a straightforward textual entailment task. Our method using flan-ul2 demonstrates an improvement of 0.27% and 2.4% over the best baseline of ID and OOD sets of AttributionBench (CITATION), respectively. (2) We also explore the role of the attention mechanism in enhancing the attribution process. Using a smaller LLM, flan-t5-small, the F1 scores outperform the baseline across almost all layers except layer 4 and layers 8 through 11.

pdf bib abs
SOMD2025: A Challenging Shared Tasks for Software Related Information Extraction
Sharmila Upadhyaya | Wolfgang Otto | Frank Krüger | Stefan Dietze

The use of software in acquiring, analyzing, and interpreting research data underscores its role as an essential artifact of scientific inquiry.Understanding and tracing the provenance of software in research helps in reproducible and collaborative research works.In this paper, we present an overview of our second iteration of the Software Mention Detection (SOMD) shared task as a part of the Scholarly Document Processing (SDP) workshop, that will be held in conjunction with ACL in 2025. We intend to foster among participants to brainstorm for optimized software mention detection and additional attributes and relation extraction tasks in the provided gold standard benchmark. Our shared task has two phases of challenges. First, the participants focus on implementing a joint framework for NER and RE for the given dataset. At the same time, the second phase includes the out-of-distribution dataset to evaluate the generalizability of the methods proposed in Phase I. The competition (March-April 2025) attracted 18 participants and spanned two months. Four teams have finished the competition and submitted full system descriptions. Participants applied various approaches, including joint and pipeline models, and explored data augmentation with LLM-generated samples.The evaluation was based on a macro-F1 score for both NER and RE, with the average reported as the SOMD-score.The winning teams achieved a SOMD-score of 0.89 in Phase I and 0.63 in Phase II, demonstrating the challenge of generalization.

pdf bib abs
From In-Distribution to Out-of-Distribution: Joint Loss for Improving Generalization in Software Mention and Relation Extraction
Stasa Mandic | Georg Niess | Roman Kern

Identifying software entities and their semantic relations in scientific texts is key for reproducibility and machine-readable knowledge graphs, yet models struggle with domain variability and sparse supervision. We address this by evaluating joint Named Entity Recognition (NER) and Relation Extraction (RE) models on the SOMD 2025 shared task, emphasizing generalization to out-of-domain scholarly texts. We propose a unified training objective that jointly optimizes both tasks using a shared loss function and demonstrates that joint loss formulations can improve out-of-domain robustness compared to disjoint training. Our results reveal significant performance gaps between in- and out-of-domain settings, prompting critical reflections on modeling strategies for software knowledge extraction. Notably, our approach ranked 1st in Phase 2 (out-of-distribution) and 2nd in Phase 1 (in-distribution) in the SOMD 2025 shared task, showing strong generalization and robust performance across domains.

Software mentions are ubiquitous yet remains irregularly referenced among scientific texts. In this paper, we utilized the dataset and evaluation criteria defined by SoftwareMention Detection (SOMD 2025) competition to solve the problem of Named Entity Recognition (NER) and Relation Extraction (RE) in input sentences from scientific texts. During the competition, we achieved a leading F1 SOMD score of 0.89 in Phase I by first fine-tuning ModernBERT for NER, and then using the extracted entity pairs for RE. Additionally, we trained a model that jointly optimizes entity and relation losses, leading to an improvement in F1 SOMD score to 0.92. Retraining the same model on an augmented dataset, we achieved the second best F1 SOMD score of 0.55 in Phase II. In the Open Submission phase, we experimented with adapative fine-tuning, achieving an F1 SOMD score of 0.6, with the best macro average for NER being 0.69. Our work shows the efficiency of fine-tuning a niche task like software mention detection despite having limited data and the promise of adaptive fine-tuning on Out of Distribution (OOD) dataset.

pdf bib abs
Inductive Learning on Heterogeneous Graphs Enhanced by LLMs for Software Mention Detection
Gabriel Silva | Mário Rodriges | António Teixeira | Marlene Amorim

This paper explores the synergy between Knowledge Graphs (KGs), Graph Machine Learning (Graph ML), and Large Language Models (LLMs) for multilingual Named Entity Recognition (NER) and Relation Extraction (RE), specifically targeting software mentions within the SOMD 2025 challenge. We propose a methodology where documents are first transformed into heterogeneous KGs enriched with linguistic features (Universal Dependencies) and external knowledge (entity linking). An inductive GraphSAGE model, operating on PyTorch Geometric’s ‘HeteroData‘ structure with dynamically generated multilingual embeddings, performs node classification tasks. For NER, Graph ML identifies candidate entities and types, with an LLM (DeepSeek v3) acting as a validation layer. For RE, Graph ML predicts dependency path convergence points indicative of relations, while the LLM classifies the relation type and direction based on entity context. Our results demonstrate the potential of this hybrid approach, showing significant performance gains post-competition (NER Phase 2 Macro F1 improved to 0.4364 from 0.2953, RE Phase 1 0.3355 Macro F1), which are already described in this paper, and highlighting the benefits of integrating structured graph learning with LLM reasoning for information extraction.

pdf bib abs
Extracting Software Mentions and Relations using Transformers and LLM-Generated Synthetic Data at SOMD 2025
Pranshu Rastogi | Rajneesh Tiwari

As part of the SOMD 2025 shared task on Software Mention Detection, we solved the problem of detecting and disambiguating software mentions in academic texts. a very important but under appreciated factor in research transparency and reproducibility. Software is an essential building block of scientific activity, but it often does not receive official citation in scholarly literature, and there are many informal mentions that are hard to follow and analyse. In order to enhance research accessibility and interpretability, we built a system that identifies software mentions and their properties (e.g., version numbers, URLs) as named entities, and classify relationships between them. Our dataset contained approximately 1,100 manually annotated sentences of full-text scholarly articles, representing diverse types of software like operating systems and applications. We fine-tuned DeBERTa based models for the Named Entity Recognition (NER) task and handled Relation Extraction (RE) as a classification problem over entity pairs. Due to the dataset size, we employed Large Language Models to create synthetic training data for augmentation. Our system achieved strong performance, with a 65% F1 score on NER (ranking 2nd in test phase) and a 47% F1 score on RE and combined macro 56% F1, showing the performance of our approach in this area.

pdf bib abs
SciVQA 2025: Overview of the First Scientific Visual Question Answering Shared Task
Ekaterina Borisova | Nikolas Rauscher | Georg Rehm

This paper provides an overview of the First Scientific Visual Question Answering (SciVQA) shared task conducted as part of the Fifth Scholarly Document Processing workshop (SDP 2025). SciVQA aims to explore the capabilities of current multimodal large language models (MLLMs) in reasoning over figures from scholarly publications for question answering (QA). The main focus of the challenge is on closed-ended visual and non-visual QA pairs. We developed the novel SciVQA benchmark comprising 3,000 images of figures and a total of 21,000 QA pairs. The shared task received seven submissions, with the best performing system achieving an average F1 score of approx. 0.86 across ROUGE-1, ROUGE-L, and BertScore metrics. Participating teams explored various fine-tuning and prompting strategies, as well as augmenting the SciVQA dataset with out-of-domain data and incorporating relevant context from source publications. The findings indicate that while MLLMs demonstrate strong performance on SciVQA, they face challenges in visual reasoning and still fall behind human judgments.

pdf bib abs
Visual Question Answering on Scientific Charts Using Fine-Tuned Vision-Language Models
Florian Schleid | Jan Strich | Chris Biemann

Scientific charts often encapsulate the core findings of research papers, making the ability to answer questions about these charts highly valuable. This paper explores recent advancements in scientific chart visual question answering (VQA) enabled by large Vision Language Models (VLMs) and newly curated datasets. As part of the SciVQA shared task from the 5th Workshop on Scholarly Document Processing, we develop and evaluate multimodal Systems capable of answering diverse question types - including multiple-choice, yes/no, unanswerable, and infinite answer set questions - based on chart images extracted from scientific literature. We investigate the effects of zero-shot and one-shot prompting, as well as supervised fine-tuning (SFT), on the performance of Qwen2.5-VL models (7B and 32B variants). We also tried to include more training data from domain-specific datasets (SpiQA and ArXivQA). Our fine-tuned Qwen2.5-VL 32B model achieves a substantial improvement over the GPT-4o-mini baseline and reaches the 4th place in the shared task, highlighting the effectiveness of domain-specific fine-tuning. We published the code for the experiments.

pdf bib abs
ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM)
Nagaraj N Bhat | Joydeb Mondal | Srijon Sarkar

We introduce RAVQA-VLM, a novel Retrieval-Augmented Generation (RAG) architecture with Vision Language Model for the SciVQA challenge, which targets closed-ended visual and nonvisual questions over scientific figures drawn from ACL Anthology and arXiv papers (Borisova and Rehm, 2025). Our system first encodes each input figure and its accompanying metadata (caption, figure ID, type) into dense embed- dings, then retrieves context passages from the full PDF of the source paper via a Dense Passage Retriever (Karpukhin et al., 2020). The extracted contexts are concatenated with the question and passed to a vision-capable generative backbone (e.g., Phi-3.5, Pixtral-12B, Mixtral-24B-small, InterVL-3-14B) fine-tuned on the 15.1K SciVQA training examples (Yang et al., 2023; Pramanick et al., 2024). We jointly optimize retrieval and generation end-to-end to minimize answer loss and mitigate hallucinations (Lewis et al., 2020; Rujun Han and Castelli, 2024). On the SciVQA test set, RAVQA-VLM achieves significant improvements over parametric only baselines, with relative gains of +5% ROUGE1 and +5% ROUGE-L, demonstrating the efficacy of RAG for multimodal scientific QA.

pdf bib abs
Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models
Christian Jaumann | Annemarie Friedrich | Rainer Lienhart

This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.

pdf bib abs
Instruction-tuned QwenChart for Chart Question Answering
Viviana Ventura | Lukas Amadeus Kleybolte | Alessandra Zarcone

Charts, where information is delivered holis-tically by visual and textual features, repre-sent a challenge when it comes to downstreamtasks such as chart question answering, whereboth kinds of information contribute to the task.The standard approach is to decouple the taskin two steps, first extracting information fromthe charts, or representing it as a table, textor code, and then a second reasoning step tooutput the answers. Today, the advancementsin visual encoding of Visual Large LanguageModels (VLLM) have shown their capabilitiesto solve such complex tasks without using in-between representations of the charts or mas-sive in-domain training. Our new instructionfine-tuned and chain-of-thought model Qwen-Chart showed that even in a complex newbenchmark such as SciVQA general modelscan achieve great performances with low-costtraining, matching the capabilities that LLMshave showed in unimodal downstream tasks.An out-of-domain evaluation showed satisfac-tory results, albeit with an expected drop inperformance.

pdf bib abs
Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
Prahitha Movva | Naga Harshita Marupaka

Scholarly articles convey valuable information not only through unstructured text but also via (semi-)structured figures such as charts and diagrams. Automatically interpreting the semantics of knowledge encoded in these figures can be beneficial for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of 0.740 and a BERTScore of 0.983 on the SciVQA test split. We also developed an ensemble model with multiple multimodal small language models (MSLMs). Through error analysis on the validation split, our ensemble approach achieves significant improvements over individual models and achieved ROUGE-1 and ROUGE-L F1 scores of 0.735 and 0.734, respectively, and a BERTScore of 0.979 on the SciVQA test split. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model’s ability in visual question answering.

pdf bib abs
The ClimateCheck Shared Task: Scientific Fact-Checking of Social Media Claims about Climate Change
Raia Abu Ahmad | Aida Usmanova | Georg Rehm

Misinformation in public discourse on global and significant issues like climate change is often facilitated through social media. However, current systems do not address fact-checking climate-related claims against trustworthy, evidence-based sources, such as scientific publications. We organised the ClimateCheck shared task at the 5th Scholarly Document Processing (SDP) Workshop, co-located with ACL 2025 in Vienna, Austria. The task featured two subtasks: 1. Abstracts retrieval given a claim, and 2. Claim verification based on the retrieved abstract. ClimateCheck had 27 registered users with active participation from 13 teams, ten of which submitted results for the first subtask and three for the second. The winning team achieved a Recall@10 score of 0.66 and a Binary Preference score of 0.49 for subtask I, and an F1 score of 0.73 for subtask II. Their method combined sparse retrieval using BM25, an ensemble of fine-tuned cross-encoder models using BGE-rerankers, and large language models for classification.

pdf bib abs
Winning ClimateCheck: A Multi-Stage System with BM25, BGE-Reranker Ensembles, and LLM-based Analysis for Scientific Abstract Retrieval
Junjun Wang | Kunlong Chen | Zhaoqun Chen | Peng He | Wenlu Zheng

The ClimateCheck shared task addresses the critical challenge of grounding social media claims about climate change in scientific literature. This paper details our winning approach. For abstract retrieval, we propose a multi-stage pipeline: (1) initial candidate generation from a corpus of ~400,000 abstracts using BM25; (2) fine-grained reranking of these candidates using an ensemble of BGE-Reranker cross-encoder models, fine-tuned with a specialized training set incorporating both random and hard negative samples; and (3) final list selection based on an RRF-ensembled score. For the verification aspect, we leverage Gemini 2.5 Pro to classify the relationship (Supports, Refutes, Not Enough Information) between claims and the retrieved abstracts, guided by carefully engineered prompts. Our system achieved first place in both subtasks, demonstrating the efficacy of combining robust sparse retrieval, powerful neural rerankers, strategic negative sampling, and LLM-based semantic analysis for connecting social media discourse to scientific evidence. Part of the example code: https://anonymous.4open.science/r/climatecheck_solution-1120

The overwhelming volume of content being published at any given moment poses a significant challenge for the design of automated fact-checking (AFC) systems on social media, requiring an emphasized consideration of efficiency aspects.As in other fields, systems built upon LLMs have achieved good results on different AFC benchmarks. However, the application of LLMs is accompanied by high resource requirements. The energy consumption of LLMs poses a significant challenge from an ecological perspective, while remaining a bottleneck in latency-sensitive scenarios like AFC within social media. Therefore, we propose a system built upon fine-tuned smaller BERT-based models. When evaluated on the ClimateCheck dataset against decoder-only LLMs, our best fine-tuned model outperforms Phi 4 and approaches Qwen3 14B in reasoning mode — while significantly reducing runtime per claim. Our findings demonstrate that small encoder-only models fine-tuned for specific tasks can still provide a substantive alternative to large decoder-only LLMs, especially in efficiency-concerned settings.

pdf bib abs
AlexUNLP-FMT at ClimateCheck Shared Task: Hybrid Retrieval with Adaptive Similarity Graph-based Reranking for Climate-related Social Media Claims Fact Checking
Mahmoud Fathallah | Nagwa El-Makky | Marwan Torki

In this paper, we describe our work done in the ClimateCheck shared task at the Scholarly document processing (SDP) workshop, ACL 2025. We focused on subtask 1: Abstracts Retrieval. The task involved retrieving relevant paper abstracts from a large corpus to verify claims made on social media about climate change. We explored various retrieval and ranking techniques, including fine-tuning transformer-based dense retrievers, sparse retrieval methods, and reranking using cross-encoder models. Our final and best-performing system utilizes a hybrid retrieval approach combining BM25 sparse retrieval and a fine-tuned Stella model for dense retrieval, followed by an MSMARCO trained minilm cross-encoder model for ranking. We adapt an iterative graph-based re-ranking approach leveraging a document similarity graph built for the document corpus to dynamically update candidate pool for reranking. This system achieved a score of 0.415 on the final test set for subtask 1, securing 3rd place in the final leader board.

pdf bib abs
ClimateCheck2025: Multi-Stage Retrieval Meets LLMs for Automated Scientfic Fact-Checking
Anna Kiepura | Jessica Lam

Misinformation on social media poses significant risks, particularly when it concerns critical scientific issues such as climate change. One promising direction for mitigation is the development of automated fact-checking systems that verify claims against authoritative scientific sources. In this work, we present our solution to the ClimateCheck2025 shared task, which involves retrieving and classifying scientific abstracts as evidence for or against given claims. Our system is built around a multi-stage hybrid retrieval pipeline that integrates lexical, sparse neural, and dense neural retrievers, followed by cross-encoder and large language model (LLM)-based reranking stages. For stance classification, we employ prompting strategies with LLMs to determine whether a retrieved abstract supports, refutes, or provides no evidence for a given claim. Our approach achieves the second-highest overall score across both subtasks of the benchmark and significantly surpasses the official baseline by 53.79% on average across Recall@2, Recall@5, Recall@10, and B-Pref. Notably, we achieve state-of-the-art performance in Recall@2. These results highlight the effectiveness of combining structured retrieval architectures with the emergent reasoning capabilities of LLMs for scientific fact verification, especially in domains where reliable human annotation is scarce and timely intervention is essential.

This paper provides an overview of the Hallucination Detection for Scientific Content (SciHal) shared task held in the 2025 ACL Scholarly Document Processing workshop. The task invites participants to detect hallucinated claims in answers to research-oriented questions generated by real-world GenAI-powered research assistants. This task is formulated as a multi-label classification problem, each instance consists of a question, an answer, an extracted claim, and supporting reference abstracts. Participants are asked to label claims under two subtasks: (1) coarse-grained detection with labels Entailment, Contradiction, or Unverifiable; and (2) fine-grained detection with a more detailed taxonomy including 8 types.The dataset consists of 500 research-oriented questions collected over one week from a generative assistant tool. These questions were rewritten using GPT-4o and manually reviewed to address potential privacy or commercial concerns. In total, 10,000 reference abstracts were retrieved, and 4,592 claims were extracted from the assistant’s answers. Each claim is annotated with hallucination labels. The dataset is divided into 3,592 training, 500 validation, and 500 test instances.Subtask 1 saw 88 submissions across 10 teams while subtask 2 saw 39 submissions across 6 teams, resulting in a total of 5 published technical reports. This paper summarizes the task design, dataset, participation, and key findings.

pdf bib abs
Detecting Hallucinations in Scientific Claims by Combining Prompting Strategies and Internal State Classification
Yupeng Cao | Chun-Nam Yu | K.p. Subbalakshmi

Large Language Model (LLM)–based research assistant tools demonstrate impressive capabilities, yet their outputs may contain hallucinations that compromise reliability. Therefore, detecting hallucinations in automatically generated scientific content is essential. SciHal2025: Hallucination Detection for Scientific Content challenge @ ACL 2025 provides a valuable platform for advancing this goal. This paper presents our solution to the SciHal2025 challenge. Our method combines several prompting strategies with the fine-tuned base LLMs. We first benchmark multiple LLMs on the SciHal dataset. Next, we developed a detection pipeline that integrates few-shot and chain-of-thought prompting. Hidden representations extracted from the LLMs serve as features for an auxiliary classifier, further improving accuracy. Finally, we fine-tuned the selected base LLMs to enhance end-to-end performance. In this paper, we present comprehensive experimental results and discuss the implications of our findings for future hallucination detection research for scientific content.

pdf bib abs
A.M.P at SciHal2025: Automated Hallucination Detection in Scientific Content via LLMs and Prompt Engineering
Le Nguyen Anh Khoa | Thìn Đặng Văn

This paper presents our system developed for SciHal2025: Hallucination Detection for Scientific Content. The primary goal of this task is to detect hallucinated claims based on the corresponding reference. Our methodology leverages strategic prompt engineering to enhance LLMs’ ability to accurately distinguish between factual assertions and hallucinations in scientific contexts. Moreover, we discovered that aggregating the fine-grained classification results from the more complex subtask (subtask 2) into the simplified label set required for the simpler subtask (subtask 1) significantly improved performance compared to direct classification for subtask 1. This work contributes to the development of more reliable AI-powered research tools by providing a systematic framework for hallucination detection in scientific content.

pdf bib abs
SciBERT Meets Contrastive Learning: A Solution for Scientific Hallucination Detection
Crivoi Carla | Ana Sabina Uban

As AI systems become more involved in scientific research, there is growing concern about the accuracy of their outputs. Tools powered by large language models can generate summaries and answers that appear well-formed, but sometimes include claims that are not actually supported by the cited references. In this paper, we focus on identifying these hallucinated claims. We propose a system built on SciBERT and contrastive learning to detect whether a scientific claim can be inferred from the referenced content. Our method was evaluated in the SciHal 2025 shared task, which includes both coarse and fine-grained hallucination labels. The results show that our model performs well on supported and clearly unsupported claims, but struggles with ambiguous or low-resource categories. These findings highlight both the promise and the limitations of current models in improving the trustworthiness of AI-generated scientific content.

pdf bib abs
Natural Language Inference Fine-tuning for Scientific Hallucination Detection
Tim Schopf | Juraj Vladika | Michael Färber | Florian Matthes

Modern generative Large Language Models (LLMs) are capable of generating text that sounds coherent and convincing, but are also prone to producing hallucinations, facts that contradict the world knowledge. Even in the case of Retrieval-Augmented Generation (RAG) systems, where relevant context is first retrieved and passed in the input, the generated facts can contradict or not be verifiable by the provided references. This has motivated SciHal 2025, a shared task that focuses on the detection of hallucinations for scientific content. The two subtasks focused on: (1) predicting whether a claim from a generated LLM answer is entailed, contradicted, or unverifiable by the used references; (2) predicting a fine-grained category of erroneous claims. Our best performing approach used an ensemble of fine-tuned encoder-only ModernBERT and DeBERTa-v3 models for classification. Out of nine competing teams, our approach achieved the first place in sub-task 1 and the second place in sub-task 2.

pdf bib abs
From RAG to Reality: Coarse-Grained Hallucination Detection via NLI Fine-Tuning
Daria Galimzianova | Aleksandr Boriskin | Grigory Arshinov

We present our submission to SciHal Subtask 1: coarse-grained hallucination detection for scientific question answering. We frame hallucination detection as an NLI-style three-way classification (entailment, contradiction, unverifiable) and show that simple fine-tuning of NLI-adapted encoder models on task data outperforms more elaborate feature-based pipelines and large language model prompting. In particular, DeBERTa-V3-large, a model pretrained on five diverse NLI corpora, achieves the highest weighted F1 on the public leaderboard. We additionally explore a pipeline combining joint claim–reference embeddings and NLI softmax probabilities fed into a classifier, but find its performance consistently below direct encoder fine-tuning. Our findings demonstrate that, for reference-grounded hallucination detection, targeted encoder fine-tuning remains the most accurate and efficient approach.