Thomas Bonnier
2025
Error Detection for Multimodal Classification
Thomas Bonnier
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Machine learning models have proven to be useful in various key applications such as autonomous driving or diagnosis prediction. When a model is implemented under real-world conditions, it is thus essential to detect potential errors with a trustworthy approach. This monitoring practice will render decision-making safer by avoiding catastrophic failures. In this paper, the focus is on multimodal classification. We introduce a method that addresses error detection based on unlabeled data. It leverages fused representations and computes the probability that a model will fail based on detected fault patterns in validation data. To improve transparency, we employ a sampling-based approximation of Shapley values in multimodal settings in order to explain why a prediction is assessed as erroneous in terms of feature values. Further, as explanation methods can sometimes disagree, we suggest evaluating the consistency of explanations produced by different value functions and algorithms. To show the relevance of our method, we measure it against a selection of 9 baselines from various domains on tabular-text and text-image datasets, and 2 multimodal fusion strategies for the classification models. Lastly, we show the usefulness of our explanation algorithm on misclassified samples.
2024
Revisiting Multimodal Transformers for Tabular Data with Text Fields
Thomas Bonnier
Findings of the Association for Computational Linguistics: ACL 2024
Tabular data with text fields can be leveraged in applications such as financial risk assessment or medical diagnosis prediction. When employing multimodal approaches to make predictions based on these modalities, it is crucial to make the most appropriate modeling choices in terms of numerical feature encoding or fusion strategy. In this paper, we focus on multimodal classification tasks based on tabular datasets with text fields. We build on multimodal Transformers to propose the Tabular-Text Transformer (TTT), a tabular/text dual-stream Transformer network. This architecture includes a distance-to-quantile embedding scheme for numerical features and an overall attention module which concurrently considers self-attention and cross-modal attention. Further, we leverage the two well-informed modality streams to estimate whether a prediction is uncertain or not. To explain uncertainty in terms of feature values, we use a sampling-based approximation of Shapley values in a bimodal context, with two options for the value function. To show the efficacy and relevance of this approach, we compare it to six baselines and measure its ability to quantify and explain uncertainty against various methods. Our code is available at https://github.com/thomas-bonnier/TabularTextTransformer.