Rom Himelstein

2025

pdf bib abs
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Ofir Azachi | Kfir Eliyahu | Eyal El Ani | Rom Himelstein | Roi Reichart | Yuval Pinter | Nitay Calderon
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM’s next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs. All data is publicly available at https://huggingface.co/datasets/wrom/Language-Vision-Hallucinations.

pdf bib abs
Jailbreak Attack Initializations as Extractors of Compliance Directions
Amit LeVi | Rom Himelstein | Yaniv Nemcovsky | Avi Mendelson | Chaim Baskin
Findings of the Association for Computational Linguistics: EMNLP 2025

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model’s activation space. Recent studies have shown that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs.

Co-authors

Venues

Fix author