Alberto Testoni


Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue?
Amit Kumar Chaudhary | Alex J. Lucassen | Ioanna Tsani | Alberto Testoni
Proceedings of the 15th International Conference on Natural Language Generation

ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments
Michael Hanna | Federico Pedeni | Alessandro Suglia | Alberto Testoni | Raffaella Bernardi
Proceedings of the 29th International Conference on Computational Linguistics

Artificial agents are nowadays challenged to perform embodied AI tasks. To succeed, agents must understand the meaning of verbs and how their corresponding actions transform the surrounding world. In this work, we propose ACT-Thor, a novel controlled benchmark for embodied action understanding. We use the AI2-THOR simulated environment to produce a controlled setup in which an agent, given a before-image and an associated action command, has to determine what the correct after-image is among a set of possible candidates. First, we assess the feasibility of the task via a human evaluation that resulted in 81.4% accuracy, and very high inter-annotator agreement (84.9%). Second, we design both unimodal and multimodal baselines, using state-of-the-art visual feature extractors. Our evaluation and error analysis suggest that only models that have a very structured representation of the actions together with powerful visual features can perform well on the task. However, they still fall behind human performance in a zero-shot scenario where the model is exposed to unseen (action, object) pairs. This paves the way for a systematic way of evaluating embodied AI agents that understand grounded actions.

pdf bib
A Small but Informed and Diverse Model: The Case of the Multimodal GuessWhat!? Guessing Game
Claudio Greco | Alberto Testoni | Raffaella Bernardi | Stella Frank
Proceedings of the 2022 CLASP Conference on (Dis)embodiment

Pre-trained Vision and Language Transformers achieve high performance on downstream tasks due to their ability to transfer representational knowledge accumulated during pretraining on substantial amounts of data. In this paper, we ask whether it is possible to compete with such models using features based on transferred (pre-trained, frozen) representations combined with a lightweight architecture. We take a multimodal guessing task as our testbed, GuessWhat?!. An ensemble of our lightweight model matches the performance of the finetuned pre-trained transformer (LXMERT). An uncertainty analysis of our ensemble shows that the lightweight transferred representations close the data uncertainty gap with LXMERT, while retaining model diversity leading to ensemble boost. We further demonstrate that LXMERT’s performance gain is due solely to its extra V&L pretraining rather than because of architectural improvements. These results argue for flexible integration of multiple features and lightweight models as a viable alternative to large, cumbersome, pre-trained models.


I’ve Seen Things You People Wouldn’t Believe”: Hallucinating Entities in GuessWhat?!
Alberto Testoni | Raffaella Bernardi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Natural language generation systems have witnessed important progress in the last years, but they are shown to generate tokens that are unrelated to the source input. This problem affects computational models in many NLP tasks, and it is particularly unpleasant in multimodal systems. In this work, we assess the rate of object hallucination in multimodal conversational agents playing the GuessWhat?! referential game. Better visual processing has been shown to mitigate this issue in image captioning; hence, we adapt to the GuessWhat?! task the best visual processing models at disposal, and propose two new models to play the Questioner agent. We show that the new models generate few hallucinations compared to other renowned models available in the literature. Moreover, their hallucinations are less severe (affect task-accuracy less) and are more human-like. We also analyse where hallucinations tend to occur more often through the dialogue: hallucinations are less frequent in earlier turns, cause a cascade hallucination effect, and are often preceded by negative answers, which have been shown to be harder to ground.

The Interplay of Task Success and Dialogue Quality: An in-depth Evaluation in Task-Oriented Visual Dialogues
Alberto Testoni | Raffaella Bernardi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

When training a model on referential dialogue guessing games, the best model is usually chosen based on its task success. We show that in the popular end-to-end approach, this choice prevents the model from learning to generate linguistically richer dialogues, since the acquisition of language proficiency takes longer than learning the guessing task. By comparing models playing different games (GuessWhat, GuessWhich, and Mutual Friends), we show that this discrepancy is model- and task-agnostic. We investigate whether and when better language quality could lead to higher task success. We show that in GuessWhat, models could increase their accuracy if they learn to ground, encode, and decode also words that do not occur frequently in the training set.

Visually Grounded Follow-up Questions: a Dataset of Spatial Questions Which Require Dialogue History
Tianai Dong | Alberto Testoni | Luciana Benotti | Raffaella Bernardi
Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics

In this paper, we define and evaluate a methodology for extracting history-dependent spatial questions from visual dialogues. We say that a question is history-dependent if it requires (parts of) its dialogue history to be interpreted. We argue that some kinds of visual questions define a context upon which a follow-up spatial question relies. We call the question that restricts the context: trigger, and we call the spatial question that requires the trigger question to be answered: zoomer. We automatically extract different trigger and zoomer pairs based on the visual property that the questions rely on (e.g. color, number). We manually annotate the automatically extracted trigger and zoomer pairs to verify which zoomers require their trigger. We implement a simple baseline architecture based on a SOTA multimodal encoder. Our results reveal that there is much room for improvement for answering history-dependent questions.

Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy
Alberto Testoni | Raffaella Bernardi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Generating goal-oriented questions in Visual Dialogue tasks is a challenging and longstanding problem. State-Of-The-Art systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we design Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model’s conjecture about the referent. We take the GuessWhat?! game as a case-study. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking.


On the role of effective and referring questions in GuessWhat?!
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti
Proceedings of the First Workshop on Advances in Language and Vision Research

Task success is the standard metric used to evaluate referential visual dialogue systems. In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat?!. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

Effective questions in referential visual dialogue
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti
Proceedings of the The Fourth Widening Natural Language Processing Workshop

An interesting challenge for situated dialogue systems is referential visual dialog: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies
Alberto Testoni | Claudio Greco | Tobias Bianchi | Mauricio Mazuecos | Agata Marcante | Luciana Benotti | Raffaella Bernardi
Proceedings of the Third International Workshop on Spatial Language Understanding

In this paper, we study the grounding skills required to answer spatial questions asked by humans while playing the GuessWhat?! game. We propose a classification for spatial questions dividing them into absolute, relational, and group questions. We build a new answerer model based on the LXMERT multimodal transformer and we compare a baseline with and without visual features of the scene. We are interested in studying how the attention mechanisms of LXMERT are used to answer spatial questions since they require putting attention on more than one region simultaneously and spotting the relation holding among them. We show that our proposed model outperforms the baseline by a large extent (9.70% on spatial questions and 6.27% overall). By analyzing LXMERT errors and its attention mechanisms, we find that our classification helps to gain a better understanding of the skills required to answer different spatial questions.


Quantifiers in a Multimodal World: Hallucinating Vision with Language and Sound
Alberto Testoni | Sandro Pezzelle | Raffaella Bernardi
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Inspired by the literature on multisensory integration, we develop a computational model to ground quantifiers in perception. The model learns to pick, out of nine quantifiers (‘few’, ‘many’, ‘all’, etc.), the one that is more likely to describe the percent of animals in a visual-auditory input containing both animals and artifacts. We show that relying on concurrent sensory inputs increases model performance on the quantification task. Moreover, we evaluate the model in a situation in which only the auditory modality is given, while the visual one is ‘hallucinanted’ either from the auditory input itself or from a linguistic caption describing the quantity of entities in the auditory input. This way, the model exploits prior associations between modalities. We show that the model profits from the prior knowledge and outperforms the auditory-only setting.


Grounded Textual Entailment
Hoa Trong Vu | Claudio Greco | Aliia Erofeeva | Somayeh Jafaritazehjan | Guido Linders | Marc Tanti | Alberto Testoni | Raffaella Bernardi | Albert Gatt
Proceedings of the 27th International Conference on Computational Linguistics

Capturing semantic relations between sentences, such as entailment, is a long-standing challenge for computational semantics. Logic-based models analyse entailment in terms of possible worlds (interpretations, or situations) where a premise P entails a hypothesis H iff in all worlds where P is true, H is also true. Statistical models view this relationship probabilistically, addressing it in terms of whether a human would likely infer H from P. In this paper, we wish to bridge these two perspectives, by arguing for a visually-grounded version of the Textual Entailment task. Specifically, we ask whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant “world” or “situation”). We use a multimodal version of the SNLI dataset (Bowman et al., 2015) and we compare “blind” and visually-augmented models of textual entailment. We show that visual information is beneficial, but we also conduct an in-depth error analysis that reveals that current multimodal models are not performing “grounding” in an optimal fashion.