Andrew Bell


2025

Sentence-level claim detection is a critical first step in the fact-checking process. While Large Language Models (LLMs) seem well-suited for claim detection, their computational cost poses challenges for real-world deployment. This paper investigates the effectiveness of both small and large pretrained Language Models for the task of claim detection. We conduct a comprehensive empirical evaluation using BERT, ModernBERT, RoBERTa, Llama, and ChatGPT-based models. Our results reveal that smaller models, when finetuned appropriately, can achieve competitive performance with significantly lower computational overhead on in-domain tasks. Notably, we also find that BERT-based models transfer poorly on sentence-level claim detection in out-of-domain tasks. We discuss the implications of these findings for practitioners and highlight directions for future research.
Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs “self-reflect,” may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict “normal” model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we make three contributions: (1) We introduce SAFENUDGE, a novel safeguard that combines Controlled Text Generation and “nudging.” SAFENUDGE triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by between 28.1% and 37.3% by guiding the LLM towards a safe response. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Second, it supports tunable SPTs, meaning practitioners can set their own tolerance for trade-offs balancing safety and restrictions to normal model behavior. Third, we release the source code for SAFENUDGE at https://github.com/joaopfonseca/SafeNudge. It is open source and compatible with the HuggingFace transformers library.