Aida Nematzadeh


Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
Aishwarya Agrawal | Ivana Kajic | Emanuele Bugliarello | Elnaz Davoodi | Anita Gergely | Phil Blunsom | Aida Nematzadeh
Findings of the Association for Computational Linguistics: EACL 2023

Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Mañas | Pau Rodriguez Lopez | Saba Ahmadi | Aida Nematzadeh | Yash Goyal | Aishwarya Agrawal
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at


Vision-Language Pretraining: Current Trends and the Future
Aishwarya Agrawal | Damien Teney | Aida Nematzadeh
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger but noisier datasets where the two modalities (e.g., image and text) loosely correspond to each other (e.g., Lu et al., 2019; Radford et al., 2021). Given a task (such as visual question answering), these models are then often fine-tuned on task-specific supervised datasets. (e.g., Lu et al., 2019; Chen et al.,2020; Tan and Bansal, 2019; Li et al., 2020a,b). In addition to the larger pretraining datasets, the transformer architecture (Vaswani et al., 2017) and in particular self-attention applied to two modalities are responsible for the impressive performance of the recent pretrained models on downstream tasks (Hendricks et al., 2021). In this tutorial, we focus on recent vision-language pretraining paradigms. Our goal is to first provide the background on image–language datasets, benchmarks, and modeling innovations before the multimodal pretraining area. Next we discuss the different family of models used for vision-language pretraining, highlighting their strengths and shortcomings. Finally, we discuss the limits of vision-language pretraining through statistical learning, and the need for alternative approaches such as causal representation learning.

pdf bib
Proceedings of the First Workshop on Learning with Natural Language Supervision
Jacob Andreas | Karthik Narasimhan | Aida Nematzadeh
Proceedings of the First Workshop on Learning with Natural Language Supervision

A Systematic Investigation of Commonsense Knowledge in Large Language Models
Xiang Lorraine Li | Adhiguna Kuncoro | Jordan Hoffmann | Cyprien de Masson d’Autume | Phil Blunsom | Aida Nematzadeh
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge — a critical component of many NLP applications. We conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pre-trained LMs, where we: (i) carefully control for the LMs’ ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation is insufficient to achieve human-level commonsense performance.


Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks | John Mellor | Rosalia Schneider | Jean-Baptiste Alayrac | Aida Nematzadeh
Transactions of the Association for Computational Linguistics, Volume 9

Abstract Recently, multimodal transformer models have gained popularity because their performance on downstream tasks suggests they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three important factors that can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions. By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance. Through architectural analysis, we learn that models with a multimodal attention mechanism can outperform deeper models with modality-specific attention mechanisms. Finally, we show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.

Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks | Aida Nematzadeh
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021


Learning to Segment Actions from Observation and Narration
Daniel Fried | Jean-Baptiste Alayrac | Phil Blunsom | Chris Dyer | Stephen Clark | Aida Nematzadeh
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We apply a generative segmental model of task structure, guided by narration, to action segmentation in video. We focus on unsupervised and weakly-supervised settings where no action labels are known during training. Despite its simplicity, our model performs competitively with previous work on a dataset of naturalistic instructional videos. Our model allows us to vary the sources of supervision used in training, and we find that both task structure and narrative language provide large benefits in segmentation quality.


Language Learning and Processing in People and Machines
Aida Nematzadeh | Richard Futrell | Roger Levy
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials

The goal of this tutorial is to bring the fields of computational linguistics and computational cognitive science closer: we will introduce different stages of language acquisition and their parallel problems in NLP. As an example, one of the early challenges children face is mapping the meaning of word labels (such as “cat”) to their referents (the furry animal in the living room). Word learning is similar to the word alignment problem in machine translation. We explain the current computational models of language acquisition, their limitations, and how the insights from these models can be incorporated into NLP applications. Moreover, we discuss how we can take advantage of the cognitive science of language in computational linguistics: for example, by designing cognitively-motivated evaluations task or buildings language-learning inductive biases into our models.


Predicting and Explaining Human Semantic Search in a Cognitive Model
Filip Miscevic | Aida Nematzadeh | Suzanne Stevenson
Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018)

Exploiting Attention to Reveal Shortcomings in Memory Models
Kaylee Burns | Aida Nematzadeh | Erin Grant | Alison Gopnik | Tom Griffiths
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

The decision making processes of deep networks are difficult to understand and while their accuracy often improves with increased architectural complexity, so too does their opacity. Practical use of machine learning models, especially for question and answering applications, demands a system that is interpretable. We analyze the attention of a memory network model to reconcile contradictory performance on a challenging question-answering dataset that is inspired by theory-of-mind experiments. We equate success on questions to task classification, which explains not only test-time failures but also how well the model generalizes to new training conditions.

Evaluating Theory of Mind in Question Answering
Aida Nematzadeh | Kaylee Burns | Erin Grant | Alison Gopnik | Tom Griffiths
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a new dataset for evaluating question answering models with respect to their capacity to reason about beliefs. Our tasks are inspired by theory-of-mind experiments that examine whether children are able to reason about the beliefs of others, in particular when those beliefs differ from reality. We evaluate a number of recent neural models with memory augmentation. We find that all fail on our tasks, which require keeping track of inconsistent states of the world; moreover, the models’ accuracy decreases notably when random sentences are introduced to the tasks at test.


A Computational Cognitive Model of Novel Word Generalization
Aida Nematzadeh | Erin Grant | Suzanne Stevenson
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing


A Cognitive Model of Semantic Network Learning
Aida Nematzadeh | Afsaneh Fazly | Suzanne Stevenson
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)


A Computational Model of Memory, Attention, and Word Learning
Aida Nematzadeh | Afsaneh Fazly | Suzanne Stevenson
Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012)