Maria Maistro

2025

pdf bib abs
Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability
Joakim Edin | Andreas Geert Motzfeldt | Casper L. Christensen | Tuukka Ruotsalo | Lars Maaløe | Maria Maistro
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Deep neural network predictions are notoriously difficult to interpret. Feature attribution methods aim to explain these predictions by identifying the contribution of each input feature. Faithfulness, often evaluated using the area over the perturbation curve (AOPC), reflects feature attributions’ accuracy in describing the internal mechanisms of deep neural networks. However, many studies rely on AOPC to compare faithfulness across different models, which we show can lead to false conclusions about models’ faithfulness. Specifically, we find that AOPC is sensitive to variations in the model, resulting in unreliable cross-model comparisons. Moreover, AOPC scores are difficult to interpret in isolation without knowing the model-specific lower and upper limits. To address these issues, we propose a normalization approach, Normalized AOPC (NAOPC), enabling consistent cross-model evaluations and more meaningful interpretation of individual scores. Our experiments demonstrate that this normalization can radically change AOPC results, questioning the conclusions of earlier studies and offering a more robust framework for assessing feature attribution faithfulness. Our code is available at https://github.com/JoakimEdin/naopc.

Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.

pdf bib abs
As easy as PIE: understanding when pruning causes language models to disagree
Pietro Tropeano | Maria Maistro | Tuukka Ruotsalo | Christina Lioma
Findings of the Association for Computational Linguistics: NAACL 2025

Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness.However, when looking at how individual data pointsare affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning,but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP.In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, andthat BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE

2024

pdf bib abs
Bridging Cultures in the Kitchen: A Framework and Benchmark for Cross-Cultural Recipe Retrieval
Tianyi Hu | Maria Maistro | Daniel Hershcovich
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The cross-cultural adaptation of recipes is an important application of identifying and bridging cultural differences in language. The challenge lies in retaining the essence of the original recipe while also aligning with the writing and dietary habits of the target culture. Information Retrieval (IR) offers a way to address the challenge because it retrieves results from the culinary practices of the target culture while maintaining relevance to the original recipe. We introduce a novel task about cross-cultural recipe retrieval and present a unique Chinese-English cross-cultural recipe retrieval benchmark. Our benchmark is manually annotated under limited resource, utilizing various retrieval models to generate a pool of candidate results for manual annotation. The dataset provides retrieval samples that are culturally adapted but textually diverse, presenting greater challenges. We propose CARROT, a plug-and-play cultural-aware recipe information retrieval framework that incorporates cultural-aware query rewriting and re-ranking methods and evaluate it both on our benchmark and intuitive human judgments. The results show that our framework significantly enhances the preservation of the original recipe and its cultural appropriateness for the target culture. We believe these insights will significantly contribute to future research on cultural adaptation.

pdf bib abs
An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records
Joakim Edin | Maria Maistro | Lars Maaløe | Lasse Borgholt | Jakob Drachmann Havtorn | Tuukka Ruotsalo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves healthcare professionals hesitant to trust them. State-of-the-art explainability methods increase model transparency but rely on human-annotated evidence spans, which are costly. In this study, we propose an approach to produce plausible and faithful explanations without needing such annotations. We demonstrate on the automated medical coding task that adversarial robustness training improves explanation plausibility and introduce AttInGrad, a new explanation method superior to previous ones. By combining both contributions in a fully unsupervised setup, we produce explanations of comparable quality, or better, to that of a supervised approach. We release our code and model weights.

Knowledge-intensive language understanding tasks require Language Models (LMs) to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. However, conflicting knowledge can be present in the LM’s parameters, termed intra-memory conflict, which can affect a model’s propensity to accept contextual knowledge. To study the effect of intra-memory conflict on LM’s ability to accept the relevant context, we utilise two knowledge conflict measures and a novel dataset containing inherently conflicting data, DYNAMICQA. This dataset includes facts with a temporal dynamic nature where facts can change over time and disputable dynamic facts, which can change depending on the viewpoint. DYNAMICQA is the first to include real-world knowledge conflicts and provide context to study the link between the different types of knowledge conflicts. We also evaluate several measures on their ability to reflect the presence of intra-memory conflict: semantic entropy and a novel coherent persuasion score. With our extensive experiments, we verify that LMs show a greater degree of intra-memory conflict with dynamic facts compared to facts that have a single truth value. Further, we reveal that facts with intra-memory conflict are harder to update with context, suggesting that retrieval-augmented generation will struggle with the most commonly adapted facts