Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
In this paper, we explore the potential benefits of leveraging eye-tracking information for dependency parsing on the English part of the Dundee corpus. To achieve this, we cast dependency parsing as a sequence labelling task and then augment the neural model for sequence labelling with eye-tracking features. We also augment a graph-based parser with eye-tracking features and parse the Dundee Corpus to corroborate our findings from the sequence labelling parser. We then experiment with a variety of parser setups ranging from parsing with all features to a delexicalized parser. Our experiments show that for a parser with all features, although the improvements are positive for the LAS score they are not significant whereas our delexicalized parser significantly outperforms the baseline we established. We also analyze the contribution of various eye-tracking features towards the different parser setups and find that eye-tracking features contain information which is complementary in nature, thus implying that augmenting the parser with various gaze features grouped together provides better performance than any individual gaze feature.
Generating images from textual descriptions has recently attracted a lot of interest. While current models can generate photo-realistic images of individual objects such as birds and human faces, synthesising images with multiple objects is still very difficult. In this paper, we propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment of generated images by leveraging the VQA 2.0 dataset. We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal. We encourage images generated from QA pairs to look realistic and additionally minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82% to 84.79% when compared to the baseline, which indicates that T2I synthesis can successfully be improved using a standard VQA model.
Despite recent achievements in natural language understanding, reasoning over commonsense knowledge still represents a big challenge to AI systems. As the name suggests, common sense is related to perception and as such, humans derive it from experience rather than from literary education. Recent works in the NLP and the computer vision field have made the effort of making such knowledge explicit using written language and visual inputs, respectively. Our premise is that the latter source fits better with the characteristics of commonsense acquisition. In this work, we explore to what extent the descriptions of real-world scenes are sufficient to learn common sense about different daily situations, drawing upon visual information to answer script knowledge questions.
Evaluations of image description systems are typically domain-general: generated descriptions for the held-out test images are either compared to a set of reference descriptions (using automated metrics), or rated by human judges on one or more Likert scales (for fluency, overall quality, and other quality criteria). While useful, these evaluations do not tell us anything about the kinds of image descriptions that systems are able to produce. Or, phrased differently, these evaluations do not tell us anything about the cognitive capabilities of image description systems. This paper proposes a different kind of assessment, that is able to quantify the extent to which these systems are able to describe humans. This assessment is based on a manual characterisation (a context-free grammar) of English entity labels in the PEOPLE domain, to determine the range of possible outputs. We examined 9 systems to see what kinds of labels they actually use. We found that these systems only use a small subset of at most 13 different kinds of modifiers (e.g. tall and short modify HEIGHT, sad and happy modify MOOD), but 27 kinds of modifiers are never used. Future research could study these semantic dimensions in more detail.