Ece Takmaz


2022

pdf
Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIP
Ece Takmaz | Sandro Pezzelle | Raquel Fernández
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

In this work, we use a transformer-based pre-trained multimodal model, CLIP, to shed light on the mechanisms employed by human speakers when referring to visual entities. In particular, we use CLIP to quantify the degree of descriptiveness (how well an utterance describes an image in isolation) and discriminativeness (to what extent an utterance is effective in picking out a single image among similar images) of human referring utterances within multimodal dialogues. Overall, our results show that utterances become less descriptive over time while their discriminativeness remains unchanged. Through analysis, we propose that this trend could be due to participants relying on the previous mentions in the dialogue history, as well as being able to distill the most discriminative information from the visual context. In general, our study opens up the possibility of using this and similar models to quantify patterns in human data and shed light on the underlying cognitive mechanisms.

pdf
Team DMG at CMCL 2022 Shared Task: Transformer Adapters for the Multi- and Cross-Lingual Prediction of Human Reading Behavior
Ece Takmaz
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

In this paper, we present the details of our approaches that attained the second place in the shared task of the ACL 2022 Cognitive Modeling and Computational Linguistics Workshop. The shared task is focused on multi- and cross-lingual prediction of eye movement features in human reading behavior, which could provide valuable information regarding language processing. To this end, we train ‘adapters’ inserted into the layers of frozen transformer-based pretrained language models. We find that multilingual models equipped with adapters perform well in predicting eye-tracking features. Our results suggest that utilizing language- and task-specific adapters is beneficial and translating test sets into similar languages that exist in the training set could help with zero-shot transferability in the prediction of human reading behavior.

2021

pdf
Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation
Sandro Pezzelle | Ece Takmaz | Raquel Fernández
Transactions of the Association for Computational Linguistics, Volume 9

Abstract This study carries out a systematic intrinsic evaluation of the semantic representations learned by state-of-the-art pre-trained multimodal Transformers. These representations are claimed to be task-agnostic and shown to help on many downstream language-and-vision tasks. However, the extent to which they align with human semantic intuitions remains unclear. We experiment with various models and obtain static word representations from the contextualized ones they learn. We then evaluate them against the semantic judgments provided by human speakers. In line with previous evidence, we observe a generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones. On the one hand, this confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Ionut-Teodor Sorodoc | Madhumita Sushil | Ece Takmaz | Eneko Agirre
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

2020

pdf
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts
Ece Takmaz | Mario Giulianelli | Sandro Pezzelle | Arabella Sinclair | Raquel Fernández
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Dialogue participants often refer to entities or situations repeatedly within a conversation, which contributes to its cohesiveness. Subsequent references exploit the common ground accumulated by the interlocutors and hence have several interesting properties, namely, they tend to be shorter and reuse expressions that were effective in previous mentions. In this paper, we tackle the generation of first and subsequent references in visually grounded dialogue. We propose a generation model that produces referring utterances grounded in both the visual and the conversational context. To assess the referring effectiveness of its output, we also implement a reference resolution system. Our experiments and analyses show that the model produces better, more effective referring utterances than a model not grounded in the dialogue context, and generates subsequent references that exhibit linguistic patterns akin to humans.

pdf
Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
Ece Takmaz | Sandro Pezzelle | Lisa Beinborn | Raquel Fernández
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled sequentially. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural—particularly when gaze is encoded with a dedicated recurrent component.

2019

pdf
Evaluating the Representational Hub of Language and Vision Models
Ravi Shekhar | Ece Takmaz | Raquel Fernández | Raffaella Bernardi
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

The multimodal models used in the emerging field at the intersection of computational linguistics and computer vision implement the bottom-up processing of the “Hub and Spoke” architecture proposed in cognitive science to represent how the brain processes and combines multi-sensory inputs. In particular, the Hub is implemented as a neural network encoder. We investigate the effect on this encoder of various vision-and-language tasks proposed in the literature: visual question answering, visual reference resolution, and visually grounded dialogue. To measure the quality of the representations learned by the encoder, we use two kinds of analyses. First, we evaluate the encoder pre-trained on the different vision-and-language tasks on an existing “diagnostic task” designed to assess multimodal semantic understanding. Second, we carry out a battery of analyses aimed at studying how the encoder merges and exploits the two modalities.

pdf
The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue
Janosch Haber | Tim Baumgärtner | Ece Takmaz | Lieke Gelderloos | Elia Bruni | Raquel Fernández
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising both their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected. To further illustrate the novel features of the dataset, we propose a baseline model for reference resolution which uses a simple method to take into account shared information accumulated in a reference chain. Our results show that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction.