Özge Alaçam

Also published as: Ozge Alacam, Özge Alacam


MOTIF: Contextualized Images for Complex Words to Improve Human Reading
Xintong Wang | Florian Schneider | Özge Alacam | Prateek Chaudhury | Chris Biemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

MOTIF (MultimOdal ConTextualized Images For Language Learners) is a multimodal dataset that consists of 1125 comprehension texts retrieved from Wikipedia Simple Corpus. Allowing multimodal processing or enriching the context with multimodal information has proven imperative for many learning tasks, specifically for second language (L2) learning. In this respect, several traditional NLP approaches can assist L2 readers in text comprehension processes, such as simplifying text or giving dictionary descriptions for complex words. As nicely stated in the well-known proverb, sometimes “a picture is worth a thousand words” and an image can successfully complement the verbal message by enriching the representation, like in Pictionary books. This multimodal support can also assist on-the-fly text reading experience by providing a multimodal tool that chooses and displays the most relevant images for the difficult words, given the text context. This study mainly focuses on one of the key components to achieving this goal; collecting a multimodal dataset enriched with complex word annotation and validated image match.

The Why and The How: A Survey on Natural Language Interaction in Visualization
Henrik Voigt | Ozge Alacam | Monique Meuschke | Kai Lawonn | Sina Zarrieß
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Natural language as a modality of interaction is becoming increasingly popular in the field of visualization. In addition to the popular query interfaces, other language-based interactions such as annotations, recommendations, explanations, or documentation experience growing interest. In this survey, we provide an overview of natural language-based interaction in the research area of visualization. We discuss a renowned taxonomy of visualization tasks and classify 119 related works to illustrate the state-of-the-art of how current natural language interfaces support their performance. We examine applied NLP methods and discuss human-machine dialogue structures with a focus on initiative, duration, and communicative functions in recent visualization-oriented dialogue interfaces. Based on this overview, we point out interesting areas for the future application of NLP methods in the field of visualization.

Modeling Referential Gaze in Task-oriented Settings of Varying Referential Complexity
Özge Alacam | Eugen Ruppert | Sina Zarrieß | Ganeshan Malhotra | Chris Biemann | Sina Zarrieß
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Referential gaze is a fundamental phenomenon for psycholinguistics and human-human communication. However, modeling referential gaze for real-world scenarios, e.g. for task-oriented communication, is lacking the well-deserved attention from the NLP community. In this paper, we address this challenging issue by proposing a novel multimodal NLP task; namely predicting when the gaze is referential. We further investigate how to model referential gaze and transfer gaze features to adapt to unseen situated settings that target different referential complexities than the training environment. We train (i) a sequential attention-based LSTM model and (ii) a multivariate transformer encoder architecture to predict whether the gaze is on a referent object. The models are evaluated on the three complexity datasets. The results indicate that the gaze features can be transferred not only among various similar tasks and scenes but also across various complexity levels. Taking the referential complexity of a scene into account is important for successful target prediction using gaze parameters especially when there is not much data for fine-tuning.

Exploring Semantic Spaces for Detecting Clustering and Switching in Verbal Fluency
Özge Alacam | Simeon Schüz | Martin Wegrzyn | Johanna Kißler | Sina Zarrieß
Proceedings of the 29th International Conference on Computational Linguistics

In this work, we explore the fitness of various word/concept representations in analyzing an experimental verbal fluency dataset providing human responses to 10 different category enumeration tasks. Based on human annotations of so-called clusters and switches between sub-categories in the verbal fluency sequences, we analyze whether lexical semantic knowledge represented in word embedding spaces (GloVe, fastText, ConceptNet, BERT) is suitable for detecting these conceptual clusters and switches within and across different categories. Our results indicate that ConceptNet embeddings, a distributional semantics method enriched with taxonomical relations, outperforms other semantic representations by a large margin. Moreover, category-specific analysis suggests that individual thresholds per category are more suited for the analysis of clustering and switching in particular embedding sub-space instead of a one-fits-all cross-category solution. The results point to interesting directions for future work on probing word embedding models on the verbal fluency task.


Towards Multi-Modal Text-Image Retrieval to improve Human Reading
Florian Schneider | Özge Alaçam | Xintong Wang | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In primary school, children’s books, as well as in modern language learning apps, multi-modal learning strategies like illustrations of terms and phrases are used to support reading comprehension. Also, several studies in educational psychology suggest that integrating cross-modal information will improve reading comprehension. We claim that state-of- he-art multi-modal transformers, which could be used in a language learner context to improve human reading, will perform poorly because of the short and relatively simple textual data those models are trained with. To prove our hypotheses, we collected a new multi-modal image-retrieval dataset based on data from Wikipedia. In an in-depth data analysis, we highlight the differences between our dataset and other popular datasets. Additionally, we evaluate several state-of-the-art multi-modal transformers on text-image retrieval on our dataset and analyze their meager results, which verify our claims.

Situation-Specific Multimodal Feature Adaptation
Özge Alacam
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

In the next decade, we will see a considerable need for NLP models for situated settings where diversity of situations and also different modalities including eye-movements should be taken into account in order to grasp the intention of the user. However, language comprehension in situated settings can not be handled in isolation, where different multimodal cues are inherently present and essential parts of the situations. In this research proposal, we aim to quantify the influence of each modality in interaction with various referential complexities. We propose to encode the referential complexity of the situated settings in the embeddings during pre-training to implicitly guide the model to the most plausible situation-specific deviations. We summarize the challenges of intention extraction and propose a methodological approach to investigate a situation-specific feature adaptation to improve crossmodal mapping and meaning recovery from noisy communication settings.


Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations
Özge Alacam | Eugen Ruppert | Amr Rekaby Salama | Tobias Staron | Wolfgang Menzel
Proceedings of the Twelfth Language Resources and Evaluation Conference

Eye4Ref is a rich multimodal dataset of eye-movement recordings collected from referentially complex situated settings where the linguistic utterances and their visual referential world were available to the listener. It consists of not only fixation parameters but also saccadic movement parameters that are time-locked to accompanying German utterances (with English translations). Additionally, it also contains symbolic knowledge (contextual) representations of the images to map the referring expressions onto the objects in corresponding images. Overall, the data was collected from 62 participants in three different experimental setups (86 systematically controlled sentence–image pairs and 1844 eye-movement recordings). Referential complexity was controlled by visual manipulations (e.g. number of objects, visibility of the target items, etc.), and by linguistic manipulations (e.g., the position of the disambiguating word in a sentence). This multimodal dataset, in which the three different sources of information namely eye-tracking, language, and visual environment are aligned, offers a test of various research questions not from only language perspective but also computer vision.


Enhancing Natural Language Understanding through Cross-Modal Interaction: Meaning Recovery from Acoustically Noisy Speech
Ozge Alacam
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Cross-modality between vision and language is a key component for effective and efficient communication, and human language processing mechanism successfully integrates information from various modalities to extract the intended meaning. However, incomplete linguistic input, i.e. due to a noisy environment, is one of the challenges for a successful communication. In that case, an incompleteness in one channel can be compensated by information from another one. In this paper, by conducting visual-world paradigm, we investigated the dynamics between syntactically possible gap fillers and the visual arrangements in incomplete German sentences and their effect on overall sentence interpretation.


Incorporating Contextual Information for Language-Independent, Dynamic Disambiguation Tasks
Tobias Staron | Özge Alaçam | Wolfgang Menzel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Text Completion using Context-Integrated Dependency Parsing
Amr Rekaby Salama | Özge Alaçam | Wolfgang Menzel
Proceedings of the Third Workshop on Representation Learning for NLP

Incomplete linguistic input, i.e. due to a noisy environment, is one of the challenges that a successful communication system has to deal with. In this paper, we study text completion with a data set composed of sentences with gaps where a successful completion cannot be achieved through a uni-modal (language-based) approach. We present a solution based on a context-integrating dependency parser incorporating an additional non-linguistic modality. An incompleteness in one channel is compensated by information from another one and the parser learns the association between the two modalities from a multiple level knowledge representation. We examined several model variations by adjusting the degree of influence of different modalities in the decision making on possible filler words and their exact reference to a non-linguistic context element. Our model is able to fill the gap with 95.4% word and 95.2% exact reference accuracy hence the successful prediction can be achieved not only on the word level (such as mug) but also with respect to the correct identification of its context reference (such as mug 2 among several mug instances).