Tinne Tuytelaars


Decoding Language Spatial Relations to 2D Spatial Arrangements
Gorjan Radevski | Guillem Collell | Marie-Francine Moens | Tinne Tuytelaars
Findings of the Association for Computational Linguistics: EMNLP 2020

We address the problem of multimodal spatial understanding by decoding a set of language-expressed spatial relations to a set of 2D spatial arrangements in a multi-object and multi-relationship setting. We frame the task as arranging a scene of clip-arts given a textual description. We propose a simple and effective model architecture Spatial-Reasoning Bert (SR-Bert), trained to decode text to 2D spatial arrangements in a non-autoregressive manner. SR-Bert can decode both explicit and implicit language to 2D spatial arrangements, generalizes to out-of-sample data to a reasonable extent and can generate complete abstract scenes if paired with a clip-arts predictor. Finally, we qualitatively evaluate our method with a user study, validating that our generated spatial arrangements align with human expectation.

Self-supervised context-aware COVID-19 document exploration through atlas grounding
Dusan Grujicic | Gorjan Radevski | Tinne Tuytelaars | Matthew Blaschko
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

In this paper, we aim to develop a self-supervised grounding of Covid-related medical text based on the actual spatial relationships between the referred anatomical concepts. More specifically, we learn to project sentences into a physical space defined by a three-dimensional anatomical atlas, allowing for a visual approach to navigating Covid-related literature. We design a straightforward and empirically effective training objective to reduce the curated data dependency issue. We use BERT as the main building block of our model and perform a quantitative analysis that demonstrates that the model learns a context-aware mapping. We illustrate two potential use-cases for our approach, one in interactive, 3D data exploration, and the other in document retrieval. To accelerate research in this direction, we make public all trained models, codebase and the developed tools, which can be accessed at https://github.com/gorjanradevski/macchina/.

Learning to ground medical text in a 3D human atlas
Dusan Grujicic | Gorjan Radevski | Tinne Tuytelaars | Matthew Blaschko
Proceedings of the 24th Conference on Computational Natural Language Learning

In this paper, we develop a method for grounding medical text into a physically meaningful and interpretable space corresponding to a human atlas. We build on text embedding architectures such as Bert and introduce a loss function that allows us to reason about the semantic and spatial relatedness of medical texts by learning a projection of the embedding into a 3D space representing the human body. We quantitatively and qualitatively demonstrate that our proposed method learns a context sensitive and spatially aware mapping, in both the inter-organ and intra-organ sense, using a large scale medical text dataset from the “Large-scale online biomedical semantic indexing” track of the 2020 BioASQ challenge. We extend our approach to a self-supervised setting, and find it to be competitive with a classification based method, and a fully supervised variant of approach.


Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision
Aparna Nurani Venkitasubramanian | Tinne Tuytelaars | Marie-Francine Moens
Proceedings of the Sixth Workshop on Vision and Language

We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a particularly challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a large degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to most work on integrated vision and language in the literature, where textual descriptions are tightly linked to the image content, and often generated in a curated fashion for the task at hand. In particular, we investigate different image representations and models, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately. This representation allows key components in the image to be isolated, in spite of largely varying backgrounds and image clutter, without an object detection or image segmentation step. The methods are evaluated based on how well they transfer to unseen camera-trap images captured across diverse topographical regions under different environmental conditions and illumination settings, involving a large domain shift.