Özge Alaçam

Also published as: Ozge Alacam, Özge Alacam

2021

pdf bib abs
Situation-Specific Multimodal Feature Adaptation
Özge Alacam
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

In the next decade, we will see a considerable need for NLP models for situated settings where diversity of situations and also different modalities including eye-movements should be taken into account in order to grasp the intention of the user. However, language comprehension in situated settings can not be handled in isolation, where different multimodal cues are inherently present and essential parts of the situations. In this research proposal, we aim to quantify the influence of each modality in interaction with various referential complexities. We propose to encode the referential complexity of the situated settings in the embeddings during pre-training to implicitly guide the model to the most plausible situation-specific deviations. We summarize the challenges of intention extraction and propose a methodological approach to investigate a situation-specific feature adaptation to improve crossmodal mapping and meaning recovery from noisy communication settings.

pdf bib abs
Towards Multi-Modal Text-Image Retrieval to improve Human Reading
Florian Schneider | Özge Alaçam | Xintong Wang | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In primary school, children’s books, as well as in modern language learning apps, multi-modal learning strategies like illustrations of terms and phrases are used to support reading comprehension. Also, several studies in educational psychology suggest that integrating cross-modal information will improve reading comprehension. We claim that state-of- he-art multi-modal transformers, which could be used in a language learner context to improve human reading, will perform poorly because of the short and relatively simple textual data those models are trained with. To prove our hypotheses, we collected a new multi-modal image-retrieval dataset based on data from Wikipedia. In an in-depth data analysis, we highlight the differences between our dataset and other popular datasets. Additionally, we evaluate several state-of-the-art multi-modal transformers on text-image retrieval on our dataset and analyze their meager results, which verify our claims.

2020

pdf bib abs
Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations
Özge Alacam | Eugen Ruppert | Amr Rekaby Salama | Tobias Staron | Wolfgang Menzel
Proceedings of the 12th Language Resources and Evaluation Conference

Eye4Ref is a rich multimodal dataset of eye-movement recordings collected from referentially complex situated settings where the linguistic utterances and their visual referential world were available to the listener. It consists of not only fixation parameters but also saccadic movement parameters that are time-locked to accompanying German utterances (with English translations). Additionally, it also contains symbolic knowledge (contextual) representations of the images to map the referring expressions onto the objects in corresponding images. Overall, the data was collected from 62 participants in three different experimental setups (86 systematically controlled sentence–image pairs and 1844 eye-movement recordings). Referential complexity was controlled by visual manipulations (e.g. number of objects, visibility of the target items, etc.), and by linguistic manipulations (e.g., the position of the disambiguating word in a sentence). This multimodal dataset, in which the three different sources of information namely eye-tracking, language, and visual environment are aligned, offers a test of various research questions not from only language perspective but also computer vision.

2019

pdf bib abs
Enhancing Natural Language Understanding through Cross-Modal Interaction: Meaning Recovery from Acoustically Noisy Speech
Ozge Alacam
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Cross-modality between vision and language is a key component for effective and efficient communication, and human language processing mechanism successfully integrates information from various modalities to extract the intended meaning. However, incomplete linguistic input, i.e. due to a noisy environment, is one of the challenges for a successful communication. In that case, an incompleteness in one channel can be compensated by information from another one. In this paper, by conducting visual-world paradigm, we investigated the dynamics between syntactically possible gap fillers and the visual arrangements in incomplete German sentences and their effect on overall sentence interpretation.

2018

pdf bib abs
Text Completion using Context-Integrated Dependency Parsing
Amr Rekaby Salama | Özge Alaçam | Wolfgang Menzel
Proceedings of The Third Workshop on Representation Learning for NLP

Incomplete linguistic input, i.e. due to a noisy environment, is one of the challenges that a successful communication system has to deal with. In this paper, we study text completion with a data set composed of sentences with gaps where a successful completion cannot be achieved through a uni-modal (language-based) approach. We present a solution based on a context-integrating dependency parser incorporating an additional non-linguistic modality. An incompleteness in one channel is compensated by information from another one and the parser learns the association between the two modalities from a multiple level knowledge representation. We examined several model variations by adjusting the degree of influence of different modalities in the decision making on possible filler words and their exact reference to a non-linguistic context element. Our model is able to fill the gap with 95.4% word and 95.2% exact reference accuracy hence the successful prediction can be achieved not only on the word level (such as mug) but also with respect to the correct identification of its context reference (such as mug 2 among several mug instances).

pdf bib
Incorporating Contextual Information for Language-Independent, Dynamic Disambiguation Tasks
Tobias Staron | Özge Alaçam | Wolfgang Menzel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Co-authors

Chris Biemann 1

Eugen Ruppert 1