Roberto Dessì

Also published as: Roberto Dessi


Communication breakdown: On the low mutual intelligibility between human and neural captioning
Roberto Dessì | Eleonora Gualdoni | Francesca Franzon | Gemma Boleda | Marco Baroni
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We compare the 0-shot performance of a neural caption-based image retriever when given as input either human-produced captions or captions generated by a neural captioner. We conduct this comparison on the recently introduced ImageCoDe data-set (Krojer et al. 2022), which contains hard distractors nearly identical to the images to be retrieved. We find that the neural retriever has much higher performance when fed neural rather than human captions, despite the fact that the former, unlike the latter, were generated without awareness of the distractors that make the task hard. Even more remarkably, when the same neural captions are given to human subjects, their retrieval performance is almost at chance level. Our results thus add to the growing body of evidence that, even when the “language” of neural models resembles English, this superficial resemblance might be deeply misleading.

Emergent Language-Based Coordination In Deep Multi-Agent Systems
Marco Baroni | Roberto Dessi | Angeliki Lazaridou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Large pre-trained deep networks are the standard building blocks of modern AI applications. This raises fundamental questions about how to control their behaviour and how to make them efficiently interact with each other. Deep net emergent communication tackles these challenges by studying how to induce communication protocols between neural network agents, and how to include humans in the communication loop. Traditionally, this research had focussed on relatively small-scale experiments where two networks had to develop a discrete code from scratch for referential communication. However, with the rise of large pre-trained language models that can work well on many tasks, the emphasis is now shifting on how to let these models interact through a language-like channel to engage in more complex behaviors. By reviewing several representative papers, we will provide an introduction to deep net emergent communication, we will cover various central topics from the present and recent past, as well as discussing current shortcomings and suggest future directions. The presentation is complemented by a hands-on section where participants will implement and analyze two emergent communications setups from the literature. The tutorial should be of interest to researchers wanting to develop more flexible AI systems, but also to cognitive scientists and linguists interested in the evolution of communication systems.


Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN
Rahma Chaabouni | Roberto Dessì | Eugene Kharitonov
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Despite their failure to solve the compositional SCAN dataset, seq2seq architectures still achieve astonishing success on more practical tasks. This observation pushes us to question the usefulness of SCAN-style compositional generalization in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.


CNNs found to jump around more skillfully than RNNs: Compositional Generalization in Seq2seq Convolutional Networks
Roberto Dessì | Marco Baroni
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Lake and Baroni (2018) introduced the SCAN dataset probing the ability of seq2seq models to capture compositional generalizations, such as inferring the meaning of “jump around” 0-shot from the component words. Recurrent networks (RNNs) were found to completely fail the most challenging generalization cases. We test here a convolutional network (CNN) on these tasks, reporting hugely improved performance with respect to RNNs. Despite the big improvement, the CNN has however not induced systematic rules, suggesting that the difference between compositional and non-compositional behaviour is not clear-cut.

Enhancing Transformer for End-to-end Speech-to-Text Translation
Mattia Antonino Di Gangi | Matteo Negri | Roldano Cattoni | Roberto Dessi | Marco Turchi
Proceedings of Machine Translation Summit XVII: Research Track


Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018
Mattia Antonino Di Gangi | Roberto Dessì | Roldano Cattoni | Matteo Negri | Marco Turchi
Proceedings of the 15th International Conference on Spoken Language Translation

This paper describes FBK’s submission to the end-to-end English-German speech translation task at IWSLT 2018. Our system relies on a state-of-the-art model based on LSTMs and CNNs, where the CNNs are used to reduce the temporal dimension of the audio input, which is in general much higher than machine translation input. Our model was trained only on the audio-to-text parallel data released for the task, and fine-tuned on cleaned subsets of the original training corpus. The addition of weight normalization and label smoothing improved the baseline system by 1.0 BLEU point on our validation set. The final submission also featured checkpoint averaging within a training run and ensemble decoding of models trained during multiple runs. On test data, our best single model obtained a BLEU score of 9.7, while the ensemble obtained a BLEU score of 10.24.