Fazl Barez
2026
Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing
Michael Lan | Narmeen Fatimah Oozeer | Chaithanya Bandi | Philip Quirke | Austin Meek | Fazl Barez | Amir Abdullah
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Michael Lan | Narmeen Fatimah Oozeer | Chaithanya Bandi | Philip Quirke | Austin Meek | Fazl Barez | Amir Abdullah
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While mechanistic interpretability (MI) has produced important insights into neural network internals, the field has yet to establish a standardized system to audit experiments. As such, many of its findings remain underutilized in safety-critical applications such as medical AI and autonomous systems, as stakeholders cannot certify their validity. Recent work demonstrates this concretely: two papers found conflicting conclusions for the same behavior, and a third study revealed that both were partially correct but incomparable due to methodological inconsistencies. Without standardized auditing, such ambiguities hinder adoption in high-stakes contexts requiring strong correctness guarantees. We call for the MI community to work towards developing a novel reviewing system that complements peer review via: (1) Continuous reviewing supported by a Collaborative Reviewing Platform where meta-science results and discussions (such as critiques, negative results, post-hoc extensions, reproductions, replications, and partial results) that fit outside of papers are organized and discussed, allowing for comments and revisions to be made at any time (2) Generalizing good practices found on this platform into expert-verified guidelines and protocols to improve auditing efficiency, and (3) Source-based auditing systems that track arguments which claims depend on. This position paper encourages constructive debate over the necessity, design and implementation of such a framework, providing early concrete examples to help catalyze these dialogues. Overall, we propose that auditing MI itself is essential for its application in AI safety, industry, and governance.
2025
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
Tingchen Fu | Fazl Barez
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tingchen Fu | Fazl Barez
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Insensitivity to semantically-preserving variations of prompts (paraphrases) is crucial for reliable behavior and real-world deployment of large language models. However, language models exhibit significant performance degradation with semantically equivalent but differently phrased prompts, and existing solutions either depend on trial-and-error prompt engineering or require computationally expensive inference-time algorithms. In this study, built on the key insight that worst-case prompts exhibit a drift in embedding space, we present Latent Adversarial Paraphrasing (LAP), a dual-loop adversarial framework that optimizes a trainable perturbation as “latent continuous paraphrase” and language model performance on these perturbations iteratively. Extensive experiments are conducted to demonstrate the effectiveness of LAP across multiple backbones on the RobustAlpaca benchmark with a 0.5%-4% absolution improvement on worst-case win-rate.
Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Adi Simhi | Itay Itzhak | Fazl Barez | Gabriel Stanovsky | Yonatan Belinkov
Findings of the Association for Computational Linguistics: EMNLP 2025
Adi Simhi | Itay Itzhak | Fazl Barez | Gabriel Stanovsky | Yonatan Belinkov
Findings of the Association for Computational Linguistics: EMNLP 2025
Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a question correctly, but a seemingly trivial perturbation, which can happen in real-world settings, causes it to produce a hallucinated response with high certainty. This phenomenon, which we dub CHOKE (Certain Hallucinations Overriding Known Evidence), is particularly concerning in high-stakes domains such as medicine or law, where model certainty is often used as a proxy for reliability. We show that CHOKE examples are consistent across prompts, occur in different models and datasets, and are fundamentally distinct from other hallucinations. This difference leads existing mitigation methods to perform worse on CHOKE examples than on general hallucinations. Finally, we introduce a probing-based mitigation that outperforms existing methods on CHOKE hallucinations. These findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety.
Precise In-Parameter Concept Erasure in Large Language Models
Yoav Gur-Arieh | Clara Haya Suslik | Yihuai Hong | Fazl Barez | Mor Geva
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yoav Gur-Arieh | Clara Haya Suslik | Yihuai Hong | Fazl Barez | Mor Geva
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES, a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 41%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Narmeen Fatimah Oozeer | Luke Marks | Fazl Barez | Amir Abdullah
Findings of the Association for Computational Linguistics: EMNLP 2025
Narmeen Fatimah Oozeer | Luke Marks | Fazl Barez | Amir Abdullah
Findings of the Association for Computational Linguistics: EMNLP 2025
Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, TONEBANK and DEBATEMIX, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.
2024
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
Clement Neo | Shay B. Cohen | Fazl Barez
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Clement Neo | Shay B. Cohen | Fazl Barez
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied independently, their interactions remain largely unexplored. This study investigates how attention heads and next-token neurons interact in LLMs to predict new words. We propose a methodology to identify next-token neurons, find prompts that highly activate them, and determine the upstream attention heads responsible. We then generate and evaluate explanations for the activity of these attention heads in an automated manner. Our findings reveal that some attention heads recognize specific contexts relevant to predicting a token and activate a downstream token-predicting neuron accordingly. This mechanism provides a deeper understanding of how attention heads work with MLP neurons to perform next-token prediction. Our approach offers a foundation for further research into the intricate workings of LLMs and their impact on text generation and understanding.
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Michael Lan | Philip Torr | Fazl Barez
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Michael Lan | Philip Torr | Fazl Barez
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
Large Language Models Relearn Removed Concepts
Michelle Lo | Fazl Barez | Shay B. Cohen
Findings of the Association for Computational Linguistics: ACL 2024
Michelle Lo | Fazl Barez | Shay B. Cohen
Findings of the Association for Computational Linguistics: ACL 2024
Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining for named entity recognition tasks. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This suggests that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model *safety*. Monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal.
Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024)
Antonio Valerio Miceli-Barone | Fazl Barez | Shay B. Cohen | Elena Voita | Ulrich Germann | Michal Lukasik
Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024)
Antonio Valerio Miceli-Barone | Fazl Barez | Shay B. Cohen | Elena Voita | Ulrich Germann | Michal Lukasik
Proceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models (SCALE-LLM 2024)
2023
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Jason Hoelscher-Obermaier | Julia Persson | Esben Kran | Ioannis Konstas | Fazl Barez
Findings of the Association for Computational Linguistics: ACL 2023
Jason Hoelscher-Obermaier | Julia Persson | Esben Kran | Ioannis Konstas | Fazl Barez
Findings of the Association for Computational Linguistics: ACL 2023
Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled KL divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects.
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python
Antonio Valerio Miceli Barone | Fazl Barez | Shay B. Cohen | Ioannis Konstas
Findings of the Association for Computational Linguistics: ACL 2023
Antonio Valerio Miceli Barone | Fazl Barez | Shay B. Cohen | Ioannis Konstas
Findings of the Association for Computational Linguistics: ACL 2023
Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.
Search
Fix author
Co-authors
- Shay B. Cohen 4
- Amir Abdullah 2
- Ioannis Konstas 2
- Michael Lan 2
- Antonio Valerio Miceli-Barone 2
- Narmeen Fatimah Oozeer 2
- Chaithanya Bandi 1
- Yonatan Belinkov 1
- Tingchen Fu 1
- Ulrich Germann 1
- Mor Geva 1
- Yoav Gur-Arieh 1
- Jason Hoelscher-Obermaier 1
- Yihuai Hong 1
- Itay Itzhak 1
- Esben Kran 1
- Michelle Lo 1
- Michal Lukasik 1
- Luke Marks 1
- Austin Meek 1
- Clement Neo 1
- Julia Persson 1
- Philip Quirke 1
- Adi Simhi 1
- Gabriel Stanovsky 1
- Clara Haya Suslik 1
- Philip Torr 1
- Elena Voita 1