David Schlangen

2024

pdf abs
Conceptual Pacts for Reference Resolution Using Small, Dynamically Constructed Language Models: A Study in Puzzle Building Dialogues
Julian Hough | Sina Zarrieß | Casey Kennington | David Schlangen | Massimo Poesio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Using Brennan and Clark’s theory of a Conceptual Pact, that when interlocutors agree on a name for an object, they are forming a temporary agreement on how to conceptualize that object, we present an extension to a simple reference resolver which simulates this process over time with different conversation pairs. In a puzzle construction domain, we model pacts with small language models for each referent which update during the interaction. When features from these pact models are incorporated into a simple bag-of-words reference resolver, the accuracy increases compared to using a standard pre-trained model. The model performs equally to a competitor using the same data but with exhaustive re-training after each prediction, while also being more transparent, faster and less resource-intensive. We also experiment with reducing the number of training interactions, and can still achieve reference resolution accuracies of over 80% in testing from observing a single previous interaction, over 20% higher than a pre-trained baseline. While this is a limited domain, we argue the model could be applicable to larger real-world applications in human and human-robot interaction and is an interpretable and transparent model.

pdf abs
Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies
Philipp Sadler | Sherzod Hakimov | David Schlangen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In collaborative goal-oriented settings, the participants are not only interested in achieving a successful outcome, but do also implicitly negotiate the effort they put into the interaction (by adapting to each other). In this work, we propose a challenging interactive reference game that requires two players to coordinate on vision and language observations. The learning signal in this game is a score (given after playing) that takes into account the achieved goal and the players’ assumed efforts during the interaction. We show that a standard Proximal Policy Optimization (PPO) setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions. And we find that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly. However, we observe that in comparison to a reasonable heuristic pairing there is still room for improvement—which invites further research in the direction of cost-sharing in collaborative interactions.

pdf abs
Evaluating Modular Dialogue System for Form Filling Using Large Language Models
Sherzod Hakimov | Yan Weiser | David Schlangen
Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT 2024)

This paper introduces a novel approach to form-filling and dialogue system evaluation by leveraging Large Language Models (LLMs). The proposed method establishes a setup wherein multiple modules collaborate on addressing the form-filling task. The dialogue system is constructed on top of LLMs, focusing on defining specific roles for individual modules. We show that using multiple independent sub-modules working cooperatively on this task can improve performance and handle the typical constraints of using LLMs, such as context limitations. The study involves testing the modular setup on four selected forms of varying topics and lengths, employing commercial and open-access LLMs. The experimental results demonstrate that the modular setup consistently outperforms the baseline, showcasing the effectiveness of this approach. Furthermore, our findings reveal that open-access models perform comparably to commercial models for the specified task.

pdf bib abs
Taking Action Towards Graceful Interaction: The Effects of Performing Actions on Modelling Policies for Instruction Clarification Requests
Brielen Madureira | David Schlangen
Proceedings of the Third Workshop on Understanding Implicit and Underspecified Language

Clarification requests are a mechanism to help solve communication problems, e.g. due to ambiguity or underspecification, in instruction-following interactions. Despite their importance, even skilful models struggle with producing or interpreting such repair acts. In this work, we test three hypotheses concerning the effects of action taking as an auxiliary task in modelling iCR policies. Contrary to initial expectations, we conclude that its contribution to learning an iCR policy is limited, but some information can still be extracted from prediction uncertainty. We present further evidence that even well-motivated, Transformer-based models fail to learn good policies for when to ask Instruction CRs (iCRs), while the task of determining what to ask about can be more successfully modelled. Considering the implications of these findings, we further discuss the shortcomings of the data-driven paradigm for learning meta-communication acts.

2023

pdf bib
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Svetlana Stoyanchev | Shafiq Joty | David Schlangen | Ondrej Dusek | Casey Kennington | Malihe Alikhani
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf abs
The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling
Brielen Madureira | Patrick Kahardipraja | David Schlangen
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Incremental dialogue model components produce a sequence of output prefixes based on incoming input. Mistakes can occur due to local ambiguities or to wrong hypotheses, making the ability to revise past outputs a desirable property that can be governed by a policy. In this work, we formalise and characterise edits and revisions in incremental sequence labelling and propose metrics to evaluate revision policies. We then apply our methodology to profile the incremental behaviour of three Transformer-based encoders in various tasks, paving the road for better revision policies.

pdf abs
Revising with a Backward Glance: Regressions and Skips during Reading as Cognitive Signals for Revision Policies in Incremental Processing
Brielen Madureira | Pelin Çelikkol | David Schlangen
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

In NLP, incremental processors produce output in instalments, based on incoming prefixes of the linguistic input. Some tokens trigger revisions, causing edits to the output hypothesis, but little is known about why models revise when they revise. A policy that detects the time steps where revisions should happen can improve efficiency. Still, retrieving a suitable signal to train a revision policy is an open problem, since it is not naturally available in datasets. In this work, we investigate the appropriateness of regressions and skips in human reading eye-tracking data as signals to inform revision policies in incremental sequence labelling. Using generalised mixed-effects models, we find that the probability of regressions and skips by humans can potentially serve as useful predictors for revisions in BiLSTMs and Transformer models, with consistent results for various languages.

pdf abs
Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples
Philipp Sadler | David Schlangen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image _i_ and text _t_), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., “_t_ is a description of _i_, for which the content of _i_ needs to be recognised and understood”).We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the “Incremental Algorithm”),which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.

pdf abs
Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset
Brielen Madureira | David Schlangen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

In visual instruction-following dialogue games, players can engage in repair mechanisms in face of an ambiguous or underspecified instruction that cannot be fully mapped to actions in the world. In this work, we annotate Instruction Clarification Requests (iCRs) in CoDraw, an existing dataset of interactions in a multimodal collaborative dialogue game. We show that it contains lexically and semantically diverse iCRs being produced self-motivatedly by players deciding to clarify in order to solve the task successfully. With 8.8k iCRs found in 9.9k dialogues, CoDraw-iCR (v1) is a large spontaneous iCR corpus, making it a valuable resource for data-driven research on clarification in dialogue. We then formalise and provide baseline models for two tasks: Determining when to make an iCR and how to recognise them, in order to investigate to what extent these tasks are learnable from data.

pdf abs
TAPIR: Learning Adaptive Revision for Incremental Natural Language Understanding with a Two-Pass Model
Patrick Kahardipraja | Brielen Madureira | David Schlangen
Findings of the Association for Computational Linguistics: ACL 2023

Language is by its very nature incremental in how it is produced and processed. This property can be exploited by NLP systems to produce fast responses, which has been shown to be beneficial for real-time interactive applications. Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers. RNNs are fast but monotonic (cannot correct earlier output, which can be necessary in incremental processing). Transformers, on the other hand, consume whole sequences, and hence are by nature non-incremental. A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise. However, this method becomes costly as the sentence grows longer. In this work, we propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy. Experimental results on sequence labelling show that our model has better incremental performance and faster inference speed compared to restart-incremental Transformers, while showing little degradation on full sequences.

pdf abs
Yes, this Way! Learning to Ground Referring Expressions into Actions with Intra-episodic Feedback from Supportive Teachers
Philipp Sadler | Sherzod Hakimov | David Schlangen
Findings of the Association for Computational Linguistics: ACL 2023

The ability to pick up on language signals in an ongoing interaction is crucial for future machine learning models to collaborate and interact with humans naturally. In this paper, we present an initial study that evaluates intra-episodic feedback given in a collaborative setting. We use a referential language game as a controllable example of a task-oriented collaborative joint activity. A teacher utters a referring expression generated by a well-known symbolic algorithm (the “Incremental Algorithm”) as an initial instruction and then monitors the follower’s actions to possibly intervene with intra-episodic feedback (which does not explicitly have to be requested). We frame this task as a reinforcement learning problem with sparse rewards and learn a follower policy for a heuristic teacher. Our results show that intra-episodic feedback allows the follower to generalize on aspects of scene complexity and performs better than providing only the initial statement.

pdf abs
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Sherzod Hakimov | David Schlangen
Findings of the Association for Computational Linguistics: ACL 2023

Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input – but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model’s output by providing a means of tracing the output back through the verbalised image content.

pdf abs
On General Language Understanding
David Schlangen
Findings of the Association for Computational Linguistics: EMNLP 2023

Natural Language Processing prides itself to be an empirically-minded, if not outright empiricist field, and yet lately it seems to get itself into essentialist debates on issues of meaning and measurement (“Do Large Language Models Understand Language, And If So, How Much?”). This is not by accident: Here, as everywhere, the evidence underspecifies the understanding. As a remedy, this paper sketches the outlines of a model of understanding, which can ground questions of the adequacy of current methods of measurement of model quality. The paper makes three claims: A) That different language use situation types have different characteristics, B) That language understanding is a multifaceted phenomenon, bringing together individualistic and social processes, and C) That the choice of Understanding Indicator marks the limits of benchmarking, and the beginnings of considerations of the ethics of NLP use.

pdf abs
clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents
Kranti Chalamalasetti | Jana Götze | Sherzod Hakimov | Brielen Madureira | Philipp Sadler | David Schlangen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recent work has proposed a methodology for the systematic evaluation of “Situated Language Understanding Agents” — agents that operate in rich linguistic and non-linguistic contexts — through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable of following game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models generally performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value.

2022

pdf abs
Can Visual Dialogue Models Do Scorekeeping? Exploring How Dialogue Representations Incrementally Encode Shared Knowledge
Brielen Madureira | David Schlangen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Cognitively plausible visual dialogue models should keep a mental scoreboard of shared established facts in the dialogue context. We propose a theory-based evaluation method for investigating to what degree models pretrained on the VisDial dataset incrementally build representations that appropriately do scorekeeping. Our conclusion is that the ability to make the distinction between shared and privately known statements along the dialogue is moderately present in the analysed models, but not always incrementally consistent, which may partially be due to the limited need for grounding interactions in the original task.

pdf abs
The slurk Interaction Server Framework: Better Data for Better Dialog Models
Jana Götze | Maike Paetzel-Prüsmann | Wencke Liermann | Tim Diekmann | David Schlangen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents the slurk software, a lightweight interaction server for setting up dialog data collections and running experiments. slurk enables a multitude of settings including text-based, speech and video interaction between two or more humans or humans and bots, and a multimodal display area for presenting shared or private interactive context. The software is implemented in Python with an HTML and JavaScript frontend that can easily be adapted to individual needs. It also provides a setup for pairing participants on common crowdworking platforms such as Amazon Mechanical Turk and some example bot scripts for common interaction scenarios.

pdf
Generating Coherent and Informative Descriptions for Groups of Visual Objects and Categories: A Simple Decoding Approach
Nazia Attari | David Schlangen | Martin Heckmann | Heiko Wersing | Sina Zarrieß
Proceedings of the 15th International Conference on Natural Language Generation

pdf
Generating Landmark-based Manipulation Instructions from Image Pairs
Sina Zarrieß | Henrik Voigt | David Schlangen | Philipp Sadler
Proceedings of the 15th International Conference on Natural Language Generation

pdf abs
Anaphoric Phenomena in Situated dialog: A First Round of Annotations
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

We present a first release of 500 documents from the multimodal corpus Tell-me-more (Ilinykh et al., 2019) annotated with coreference information according to the ARRAU guidelines (Poesio et al., 2021). The corpus consists of images and short texts of five sentences. We describe the annotation process and present the adaptations to the original guidelines in order to account for the challenges of grounding the annotations to the image. 50 documents from the 500 available are annotated by two people and used to estimate inter-annotator agreement (IAA) relying on Krippendorff’s alpha.

pdf abs
New or Old? Exploring How Pre-Trained Language Models Represent Discourse Entities
Sharid Loáiciga | Anne Beyer | David Schlangen
Proceedings of the 29th International Conference on Computational Linguistics

Recent research shows that pre-trained language models, built to generate text conditioned on some context, learn to encode syntactic knowledge to a certain degree. This has motivated researchers to move beyond the sentence-level and look into their ability to encode less studied discourse-level phenomena. In this paper, we add to the body of probing research by investigating discourse entity representations in large pre-trained language models in English. Motivated by early theories of discourse and key pieces of previous work, we focus on the information-status of entities as discourse-new or discourse-old. We present two probing models, one based on binary classification and another one on sequence labeling. The results of our experiments show that pre-trained language models do encode information on whether an entity has been introduced before or not in the discourse. However, this information alone is not sufficient to find the entities in a discourse, opening up interesting questions about the definition of entities for future work.

pdf abs
Norm Participation Grounds Language
David Schlangen
Proceedings of the 2022 CLASP Conference on (Dis)embodiment

The striking recent advances in eliciting seemingly meaningful language behaviour from language-only machine learning models have only made more apparent, through the surfacing of clear limitations, the need to go beyond the language-only mode and to ground these models “in the world”. Proposals for doing so vary in the details, but what unites them is that the solution is sought in the addition of non-linguistic data types such as images or video streams, while largely keeping the mode of learning constant. I propose a different, and more wide-ranging conception of how grounding should be understood: What grounds language is its normative nature. There are standards for doing things right, these standards are public and authoritative, while at the same time acceptance of authority can and must be disputed and negotiated, in interactions in which only bearers of normative status can rightfully participate. What grounds language, then, is the determined use that language users make of it, and what it is grounded in is the community of language users. I sketch this idea, and draw some conclusions for work on computational modelling of meaningful language use.

2021

pdf abs
Is Incoherence Surprising? Targeted Evaluation of Coherence Prediction from Language Models
Anne Beyer | Sharid Loáiciga | David Schlangen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Coherent discourse is distinguished from a mere collection of utterances by the satisfaction of a diverse set of constraints, for example choice of expression, logical relation between denoted events, and implicit compatibility with world-knowledge. Do neural language models encode such constraints? We design an extendable set of test suites addressing different aspects of discourse and dialogue coherence. Unlike most previous coherence evaluation studies, we address specific linguistic devices beyond sentence order perturbations, which allow for a more fine-grained analysis of what constitutes coherence and what neural models trained on a language modelling objective are capable of encoding. Extending the targeted evaluation paradigm for neural language models (Marvin and Linzen, 2018) to phenomena beyond syntax, we show that this paradigm is equally suited to evaluate linguistic qualities that contribute to the notion of coherence.

pdf abs
Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU
Patrick Kahardipraja | Brielen Madureira | David Schlangen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole, abstracting away the notion of time. Recent work attempts to apply Transformers incrementally via restart-incrementality by repeatedly feeding, to an unchanged model, increasingly longer input prefixes to produce partial outputs. However, this approach is computationally costly and does not scale efficiently for long sequences. In parallel, we witness efforts to make Transformers more efficient, e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, we examine the feasibility of LT for incremental NLU in English. Our results show that the recurrent LT model has better incremental performance and faster inference speed compared to the standard Transformer and LT with restart-incrementality, at the cost of part of the non-incremental (full sequence) quality. We show that the performance drop can be mitigated by training the model to wait for right context before committing to an output and that training with input prefixes is beneficial for delivering correct partial outputs.

pdf abs
Reference and coreference in situated dialogue
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the Second Workshop on Advances in Language and Vision Research

In recent years several corpora have been developed for vision and language tasks. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. Working with two corpora approaching this goal, we present a linguistic perspective on some of the challenges in creating and extending resources combining language and vision while preserving continuity with the existing best practices in the area of coreference annotation.

pdf abs
Space Efficient Context Encoding for Non-Task-Oriented Dialogue Generation with Graph Attention Transformer
Fabian Galetzka | Jewgeni Rose | David Schlangen | Jens Lehmann
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To improve the coherence and knowledge retrieval capabilities of non-task-oriented dialogue systems, recent Transformer-based models aim to integrate fixed background context. This often comes in the form of knowledge graphs, and the integration is done by creating pseudo utterances through paraphrasing knowledge triples, added into the accumulated dialogue context. However, the context length is fixed in these architectures, which restricts how much background or dialogue context can be kept. In this work, we propose a more concise encoding for background context structured in the form of knowledge graphs, by expressing the graph connections through restrictions on the attention weights. The results of our human evaluation show that this encoding reduces space requirements without negative effects on the precision of reproduction of knowledge and perceived consistency. Further, models trained with our proposed context encoding generate dialogues that are judged to be more comprehensive and interesting.

pdf abs
Targeting the Benchmark: On Methodology in Current Natural Language Processing Research
David Schlangen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

It has become a common pattern in our field: One group introduces a language task, exemplified by a dataset, which they argue is challenging enough to serve as a benchmark. They also provide a baseline model for it, which then soon is improved upon by other groups. Often, research efforts then move on, and the pattern repeats itself. What is typically left implicit is the argumentation for why this constitutes progress, and progress towards what. In this paper, we try to step back for a moment from this pattern and work out possible argumentations and their parts.

pdf abs
Annotating anaphoric phenomena in situated dialogue
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

In recent years several corpora have been developed for vision and language tasks. With this paper, we intend to start a discussion on the annotation of referential phenomena in situated dialogue. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. In addition, a rich annotation scheme covering a broad range of referential phenomena and compatible with the textual task of coreference resolution is necessary in order to take the most advantage of these corpora. Consequently, there are several open questions regarding the semantics of reference and annotation, and the extent to which standard textual coreference accounts for the situated dialogue genre. Working with two corpora on situated dialogue, we present our extension to the ARRAU (Uryupina et al., 2020) annotation scheme in order to start this discussion.

pdf abs
Incremental Unit Networks for Multimodal, Fine-grained Information State Representation
Casey Kennington | David Schlangen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We offer a fine-grained information state annotation scheme that follows directly from the Incremental Unit abstract model of dialogue processing when used within a multimodal, co-located, interactive setting. We explain the Incremental Unit model and give an example application using the Localized Narratives dataset, then offer avenues for future research.

2020

pdf abs
A Corpus of Controlled Opinionated and Knowledgeable Movie Discussions for Training Neural Conversation Models
Fabian Galetzka | Chukwuemeka Uchenna Eneh | David Schlangen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Fully data driven Chatbots for non-goal oriented dialogues are known to suffer from inconsistent behaviour across their turns, stemming from a general difficulty in controlling parameters like their assumed background personality and knowledge of facts. One reason for this is the relative lack of labeled data from which personality consistency and fact usage could be learned together with dialogue behaviour. To address this, we introduce a new labeled dialogue dataset in the domain of movie discussions, where every dialogue is based on pre-specified facts and opinions. We thoroughly validate the collected dialogue for adherence of the participants to their given fact and opinion profile, and find that the general quality in this respect is high. This process also gives us an additional layer of annotation that is potentially useful for training models. We introduce as a baseline an end-to-end trained self-attention decoder model trained on this data and show that it is able to generate opinionated responses that are judged to be natural and knowledgeable and show attentiveness.

pdf abs
From “Before” to “After”: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain
Robin Rojowiec | Jana Götze | Philipp Sadler | Henrik Voigt | Sina Zarrieß | David Schlangen
Proceedings of the 13th International Conference on Natural Language Generation

While certain types of instructions can be com-pactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before/After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. (2016) and investigate whether it is suitable for training an instruction generator as well. We find that it is, and investigate several simple baselines, taking these from the related task of image captioning. Through a series of experiments that simplify the task (by making image processing easier or completely side-stepping it; and by creating template-based targeted instructions), we investigate areas for improvement. We find that captioning models get some way towards solving the task, but have some difficulty with it, and future improvements must lie in the way the change is detected in the instruction.

pdf abs
Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU
Brielen Madureira | David Schlangen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While humans process language incrementally, the best language encoders currently used in NLP do not. Both bidirectional LSTMs and Transformers assume that the sequence that is to be encoded is available in full, to be processed either forwards and backwards (BiLSTMs) or as a whole (Transformers). We investigate how they behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems. We test five models on various NLU datasets and compare their performance using three incremental evaluation metrics. The results support the possibility of using bidirectional encoders in incremental mode while retaining most of their non-incremental quality. The “omni-directional” BERT model, which achieves better non-incremental performance, is impacted more by the incremental access. This can be alleviated by adapting the training regime (truncated training), or the testing procedure, by delaying the output until some right context is available or by incorporating hypothetical right contexts generated by a language model like GPT-2.

pdf bib
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Qun Liu | David Schlangen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

2019

pdf abs
Natural Language Semantics With Pictures: Some Language & Vision Datasets and Potential Uses for Computational Semantics
David Schlangen
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Propelling, and propelled by, the “deep learning revolution”, recent years have seen the introduction of ever larger corpora of images annotated with natural language expressions. We survey some of these corpora, taking a perspective that reverses the usual directionality, as it were, by viewing the images as semantic annotation of the natural language expressions. We discuss datasets that can be derived from the corpora, and tasks of potential interest for computational semanticists that can be defined on those. In this, we make use of relations provided by the corpora (namely, the link between expression and image, and that between two expressions linked to the same image) and relations that we can add (similarity relations between expressions, or between images). Specifically, we show that in this way we can create data that can be used to learn and evaluate lexical and compositional grounded semantics, and we show that the “linked to same image” relation tracks a semantic implication relation that is recognisable to annotators even in the absence of the linking image as evidence. Finally, as an example of possible benefits of this approach, we show that an exemplar-model-based approach to implication beats a (simple) distributional space-based one on some derived datasets, while lending itself to explainability.

pdf abs
From Explainability to Explanation: Using a Dialogue Setting to Elicit Annotations with Justifications
Nazia Attari | Martin Heckmann | David Schlangen
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

Despite recent attempts in the field of explainable AI to go beyond black box prediction models, typically already the training data for supervised machine learning is collected in a manner that treats the annotator as a “black box”, the internal workings of which remains unobserved. We present an annotation method where a task is given to a pair of annotators who collaborate on finding the best response. With this we want to shed light on the questions if the collaboration increases the quality of the responses and if this “thinking together” provides useful information in itself, as it at least partially reveals their reasoning steps. Furthermore, we expect that this setting puts the focus on explanation as a linguistic act, vs. explainability as a property of models. In a crowd-sourcing experiment, we investigated three different annotation tasks, each in a collaborative dialogical (two annotators) and monological (one annotator) setting. Our results indicate that our experiment elicits collaboration and that this collaboration increases the response accuracy. We see large differences in the annotators’ behavior depending on the task. Similarly, we also observe that the dialog patterns emerging from the collaboration vary significantly with the task.

pdf abs
Tell Me More: A Dataset of Visual Scene Description Sequences
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 12th International Conference on Natural Language Generation

We present a dataset consisting of what we call image description sequences, which are multi-sentence descriptions of the contents of an image. These descriptions were collected in a pseudo-interactive setting, where the describer was told to describe the given image to a listener who needs to identify the image within a set of images, and who successively asks for more information. As we show, this setup produced nicely structured data that, we think, will be useful for learning models capable of planning and realising such description discourses.

pdf abs
Can Neural Image Captioning be Controlled via Forced Attention?
Philipp Sadler | Tatjana Scheffler | David Schlangen
Proceedings of the 12th International Conference on Natural Language Generation

Learned dynamic weighting of the conditioning signal (attention) has been shown to improve neural language generation in a variety of settings. The weights applied when generating a particular output sequence have also been viewed as providing a potentially explanatory insight in the internal workings of the generator. In this paper, we reverse the direction of this connection and ask whether through the control of the attention of the model we can control its output. Specifically, we take a standard neural image captioning model that uses attention, and fix the attention to predetermined areas in the image. We evaluate whether the resulting output is more likely to mention the class of the object in that area than the normally generated caption. We introduce three effective methods to control the attention and find that these are producing expected results in up to 27.43% of the cases.

pdf abs
Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories
Sina Zarrieß | David Schlangen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Zero-shot learning in Language & Vision is the task of correctly labelling (or naming) objects of novel categories. Another strand of work in L&V aims at pragmatically informative rather than “correct” object descriptions, e.g. in reference games. We combine these lines of research and model zero-shot reference games, where a speaker needs to successfully refer to a novel object in an image. Inspired by models of “rational speech acts”, we extend a neural generator to become a pragmatic speaker reasoning about uncertain object categories. As a result of this reasoning, the generator produces fewer nouns and names of distractor categories as compared to a literal speaker. We show that this conversational strategy for dealing with novel objects often improves communicative success, in terms of resolution accuracy of an automatic listener.

2018

pdf abs
The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp!, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.

pdf abs
Decoding Strategies for Neural Referring Expression Generation
Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

RNN-based sequence generation is now widely used in NLP and NLG (natural language generation). Most work focusses on how to train RNNs, even though also decoding is not necessarily straightforward: previous work on neural MT found seq2seq models to radically prefer short candidates, and has proposed a number of beam search heuristics to deal with this. In this work, we assess decoding strategies for referring expression generation with neural models. Here, expression length is crucial: output should neither contain too much or too little information, in order to be pragmatically adequate. We find that most beam search heuristics developed for MT do not generalize well to referring expression generation (REG), and do not generally outperform greedy decoding. We observe that beam search heuristics for termination seem to override the model’s knowledge of what a good stopping point is. Therefore, we also explore a recent approach called trainable decoding, which uses a small network to modify the RNN’s hidden state for better decoding results. We find this approach to consistently outperform greedy decoding for REG.

pdf abs
Being data-driven is not enough: Revisiting interactive instruction giving as a challenge for NLG
Sina Zarrieß | David Schlangen
Proceedings of the Workshop on NLG for Human–Robot Interaction

Modeling traditional NLG tasks with data-driven techniques has been a major focus of research in NLG in the past decade. We argue that existing modeling techniques are mostly tailored to textual data and are not sufficient to make NLG technology meet the requirements of agents which target fluid interaction and collaboration in the real world. We revisit interactive instruction giving as a challenge for datadriven NLG and, based on insights from previous GIVE challenges, propose that instruction giving should be addressed in a setting that involves visual grounding and spoken language. These basic design decisions will require NLG frameworks that are capable of monitoring their environment as well as timing and revising their verbal output. We believe that these are core capabilities for making NLG technology transferrable to interactive systems.

pdf
A Corpus of Natural Multimodal Spatial Scene Descriptions
Ting Han | David Schlangen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Deriving continous grounded meaning representations from referentially structured multimodal contexts
Sina Zarrieß | David Schlangen
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Corpora of referring expressions paired with their visual referents are a good source for learning word meanings directly grounded in visual representations. Here, we explore additional ways of extracting from them word representations linked to multi-modal context: through expressions that refer to the same object, and through expressions that refer to different objects in the same scene. We show that continuous meaning representations derived from these contexts capture complementary aspects of similarity, , even if not outperforming textual embeddings trained on very large amounts of raw text when tested on standard similarity benchmarks. We propose a new task for evaluating grounded meaning representations—detection of potentially co-referential phrases—and show that it requires precise denotational representations of attribute meanings, which our method provides.

pdf abs
Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech
Julian Hough | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We present the joint task of incremental disfluency detection and utterance segmentation and a simple deep learning system which performs it on transcripts and ASR results. We show how the constraints of the two tasks interact. Our joint-task system outperforms the equivalent individual task systems, provides competitive results and is suitable for future use in conversation agents in the psychiatric domain.

pdf abs
Is this a Child, a Girl or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings
Sina Zarrieß | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

There has recently been a lot of work trying to use images of referents of words for improving vector space meaning representations derived from text. We investigate the opposite direction, as it were, trying to improve visual word predictors that identify objects in images, by exploiting distributional similarity information during training. We show that for certain words (such as entry-level nouns or hypernyms), we can indeed learn better referential word meanings by taking into account their semantic similarity to other words. For other words, there is no or even a detrimental effect, compared to a learning setup that presents even semantically related objects as negative instances.

pdf abs
Grounding Language by Continuous Observation of Instruction Following
Ting Han | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Grounded semantics is typically learnt from utterance-level meaning representations (e.g., successful database retrievals, denoted objects in images, moves in a game). We explore learning word and utterance meanings by continuous observation of the actions of an instruction follower (IF). While an instruction giver (IG) provided a verbal description of a configuration of objects, IF recreated it using a GUI. Aligning these GUI actions to sub-utterance chunks allows a simple maximum entropy model to associate them as chunk meaning better than just providing it with the utterance-final configuration. This shows that semantics useful for incremental (word-by-word) application, as required in natural dialogue, might also be better acquired from incremental settings.

pdf abs
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images
Sina Zarrieß | M. Soledad López Gambino | David Schlangen
Proceedings of the 10th International Conference on Natural Language Generation

Current referring expression generation systems mostly deliver their output as one-shot, written expressions. We present on-going work on incremental generation of spoken expressions referring to objects in real-world images. This approach extends upon previous work using the words-as-classifier model for generation. We implement this generator in an incremental dialogue processing framework such that we can exploit an existing interface to incremental text-to-speech synthesis. Our system generates and synthesizes referring expressions while continuously observing non-verbal user reactions.

pdf abs
Beyond On-hold Messages: Conversational Time-buying in Task-oriented Dialogue
Soledad López Gambino | Sina Zarrieß | David Schlangen
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

A common convention in graphical user interfaces is to indicate a “wait state”, for example while a program is preparing a response, through a changed cursor state or a progress bar. What should the analogue be in a spoken conversational system? To address this question, we set up an experiment in which a human information provider (IP) was given their information only in a delayed and incremental manner, which systematically created situations where the IP had the turn but could not provide task-related information. Our data analysis shows that 1) IPs bridge the gap until they can provide information by re-purposing a whole variety of task- and grounding-related communicative actions (e.g. echoing the user’s request, signaling understanding, asserting partially relevant information), rather than being silent or explicitly asking for time (e.g. “please wait”), and that 2) IPs combined these actions productively to ensure an ongoing conversation. These results, we argue, indicate that natural conversational interfaces should also be able to manage their time flexibly using a variety of conversational resources.

pdf abs
Natural Language Informs the Interpretation of Iconic Gestures: A Computational Approach
Ting Han | Julian Hough | David Schlangen
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

When giving descriptions, speakers often signify object shape or size with hand gestures. Such so-called ‘iconic’ gestures represent their meaning through their relevance to referents in the verbal content, rather than having a conventional form. The gesture form on its own is often ambiguous, and the aspect of the referent that it highlights is constrained by what the language makes salient. We show how the verbal content guides gesture interpretation through a computational model that frames the task as a multi-label classification task that maps multimodal utterances to semantic categories, using annotated human-human data.

pdf abs
Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task
Ting Han | David Schlangen
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

While language conveys meaning largely symbolically, actual communication acts typically contain iconic elements as well: People gesture while they speak, or may even draw sketches while explaining something. Image retrieval prima facie seems like a task that could profit from combined symbolic and iconic reference, but it is typically set up to work either from language only, or via (iconic) sketches with no verbal contribution. Using a model of grounded language semantics and a model of sketch-to-image mapping, we show that adding even very reduced iconic information to a verbal image description improves recall. Verbal descriptions paired with fully detailed sketches still perform better than these sketches alone. We see these results as supporting the assumption that natural user interfaces should respond to multimodal input, where possible, rather than just language alone.

pdf abs
Obtaining referential word meanings from visual and distributional information: Experiments on object naming
Sina Zarrieß | David Schlangen
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate object naming, which is an important sub-task of referring expression generation on real-world images. As opposed to mutually exclusive labels used in object recognition, object names are more flexible, subject to communicative preferences and semantically related to each other. Therefore, we investigate models of referential word meaning that link visual to lexical information which we assume to be given through distributional word embeddings. We present a model that learns individual predictors for object names that link visual and distributional aspects of word meaning during training. We show that this is particularly beneficial for zero-shot learning, as compared to projecting visual objects directly into the distributional space. In a standard object naming task, we find that different ways of combining lexical and visual information achieve very similar performance, though experiments on model combination suggest that they capture complementary aspects of referential meaning.

2016

pdf
Real-Time Understanding of Complex Discriminative Scene Descriptions
Ramesh Manuvinakurike | Casey Kennington | David DeVault | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf
Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State
Casey Kennington | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf
Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems
Ramesh Manuvinakurike | Maike Paetzel | Cheng Qu | David Schlangen | David DeVault
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf
Investigating Fluidity for Human-Robot Interaction with Real-time, Real-world Grounding Strategies
Julian Hough | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf
Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected?
Sina Zarrieß | David Schlangen
Proceedings of the 9th International Natural Language Generation conference

pdf
Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs
Sina Zarrieß | David Schlangen
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Resolving References to Objects in Photographs using the Words-As-Classifiers Model
David Schlangen | Sina Zarrieß | Casey Kennington
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

PentoRef is a corpus of task-oriented dialogues collected in systematically manipulated settings. The corpus is multilingual, with English and German sections, and overall comprises more than 20000 utterances. The dialogues are fully transcribed and annotated with referring expressions mapped to objects in corresponding visual scenes, which makes the corpus a rich resource for research on spoken referring expressions in generation and resolution. The corpus includes several sub-corpora that correspond to different dialogue situations where parameters related to interactivity, visual access, and verbal channel have been manipulated in systematic ways. The corpus thus lends itself to very targeted studies of reference in spontaneous dialogue.

pdf abs
DUEL: A Multi-lingual Multimodal Dialogue Corpus for Disfluency, Exclamations and Laughter
Julian Hough | Ye Tian | Laura de Ruiter | Simon Betz | Spyros Kousidis | David Schlangen | Jonathan Ginzburg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present the DUEL corpus, consisting of 24 hours of natural, face-to-face, loosely task-directed dialogue in German, French and Mandarin Chinese. The corpus is uniquely positioned as a cross-linguistic, multimodal dialogue resource controlled for domain. DUEL includes audio, video and body tracking data and is transcribed and annotated for disfluency, laughter and exclamations.

In order to explore intuitive verbal and non-verbal interfaces in smart environments we recorded user interactions with an intelligent apartment. Besides offering various interactive capabilities itself, the apartment is also inhabited by a social robot that is available as a humanoid interface. This paper presents a multi-modal corpus that contains goal-directed actions of naive users in attempts to solve a number of predefined tasks. Alongside audio and video recordings, our data-set consists of large amount of temporally aligned sensory data and system behavior provided by the environment and its interactive components. Non-verbal system responses such as changes in light or display contents, as well as robot and apartment utterances and gestures serve as a rich basis for later in-depth analysis. Manual annotations provide further information about meta data like the current course of study and user behavior including the incorporated modality, all literal utterances, language features, emotional expressions, foci of attention, and addressees.

David Schlangen

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2005

2004

2003

Co-authors

Venues