2025
pdf
bib
abs
ScanEZ: Integrating Cognitive Models with Self-Supervised Learning for Spatiotemporal Scanpath Prediction
Ekta Sood
|
Prajit Dhar
|
Enrica Troiano
|
Rosy Southwell
|
Sidney K. DMello
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Accurately predicting human scanpaths duringreading is vital for diverse fields and downstream tasks, from educational technologies toautomatic question answering. To date, however, progress in this direction remains limited by scarce gaze data. We overcome theissue with ScanEZ, a self-supervised framework grounded in cognitive models of reading.ScanEZ jointly models the spatial and temporal dimensions of scanpaths by leveraging synthetic data and a 3-D gaze objective inspired bymasked language modeling. With this framework, we provide evidence that two key factorsin scanpath prediction during reading are: theuse of masked modeling of both spatial andtemporal patterns of eye movements, and cognitive model simulations as an inductive biasto kick-start training. Our approach achievesstate-of-the-art results on established datasets(e.g., up to 31.4% negative log-likelihood improvement on CELER L1), and proves portableacross different experimental conditions.
2024
pdf
bib
abs
InteRead: An Eye Tracking Dataset of Interrupted Reading
Francesca Zermiani
|
Prajit Dhar
|
Ekta Sood
|
Fabian Kögel
|
Andreas Bulling
|
Maria Wirzberger
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Eye movements during reading offer a window into cognitive processes and language comprehension, but the scarcity of reading data with interruptions – which learners frequently encounter in their everyday learning environments – hampers advances in the development of intelligent learning technologies. We introduce InteRead – a novel 50-participant dataset of gaze data recorded during self-paced reading of real-world text. InteRead further offers fine-grained annotations of interruptions interspersed throughout the text as well as resumption lags incurred by these interruptions. Interruptions were triggered automatically once readers reached predefined target words. We validate our dataset by reporting interdisciplinary analyses on different measures of gaze behavior. In line with prior research, our analyses show that the interruptions as well as word length and word frequency effects significantly impact eye movements during reading. We also explore individual differences within our dataset, shedding light on the potential for tailored educational solutions. InteRead is accessible from our datasets web-page: https://www.ife.uni-stuttgart.de/en/llis/research/datasets/.
2022
pdf
bib
abs
Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA
Adnen Abdessaied
|
Ekta Sood
|
Andreas Bulling
Proceedings of the 7th Workshop on Representation Learning for NLP
We propose the Video Language Co-Attention Network (VLCN) – a novel memory-enhanced model for Video Question Answering (VideoQA). Our model combines two original contributions”:” A multi-modal fast-learning feature fusion (FLF) block and a mechanism that uses self-attended language features to separately guide neural attention on both static and dynamic visual features extracted from individual video frames and short video clips. When trained from scratch, VLCN achieves competitive results with the state of the art on both MSVD-QA and MSRVTT-QA with 38.06% and 36.01% test accuracies, respectively. Through an ablation study, we further show that FLF improves generalization across different VideoQA datasets and performance for question types that are notoriously challenging in current datasets, such as long questions that require deeper reasoning as well as questions with rare answers.
2021
pdf
bib
abs
VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering
Ekta Sood
|
Fabian Kögel
|
Florian Strohm
|
Prajit Dhar
|
Andreas Bulling
Proceedings of the 25th Conference on Computational Natural Language Learning
We present VQA-MHUG – a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modular Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show – for the first time – that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.
2020
pdf
bib
abs
Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension
Ekta Sood
|
Simon Tannert
|
Diego Frassinelli
|
Andreas Bulling
|
Ngoc Thang Vu
Proceedings of the 24th Conference on Computational Natural Language Learning
While neural networks with attention mechanisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention and neural attention in machine reading comprehension. To this end, we introduce a novel 23 participant eye tracking dataset - MQA-RC, in which participants read movie plots and answered pre-defined questions. We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures. We find that higher similarity to human attention and performance significantly correlates to the LSTM and CNN models. However, we show this relationship does not hold true for the XLNet models – despite the fact that the XLNet performs best on this challenging task. Our results suggest that different architectures seem to learn rather different neural attention strategies and similarity of neural to human attention does not guarantee best performance.
2018
pdf
bib
abs
Comparing Attention-Based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension
Matthias Blohm
|
Glorianna Jagfeld
|
Ekta Sood
|
Xiang Yu
|
Ngoc Thang Vu
Proceedings of the 22nd Conference on Computational Natural Language Learning
We propose a machine reading comprehension model based on the compare-aggregate framework with two-staged attention that achieves state-of-the-art results on the MovieQA question answering dataset. To investigate the limitations of our model as well as the behavioral difference between convolutional and recurrent neural networks, we generate adversarial examples to confuse the model and compare to human performance. Furthermore, we assess the generalizability of our model by analyzing its differences to human inference, drawing upon insights from cognitive science.