David M. Howcroft

Also published as: David Howcroft

2025

pdf bib abs
Participatory Design for Positive Impact: Behind the Scenes of Three NLP Projects
Marianne Wilson | David M. Howcroft | Ioannis Konstas | Dimitra Gkatzia | Gavin Abercrombie
Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Researchers in Natural Language Processing (NLP) are increasingly adopting participatory design (PD) principles to better achieve positive outcomes for stakeholders. This paper evaluates two PD perspectives proposed by Delgado et al. (2023) and Caselli et al. (2021) as interpretive and planning tools for NLP research. We reflect on our experiences adopting PD practices in three NLP projects that aim to create positive impact for different communities, and that span different domains and stages of NLP research. We assess how our projects align with PD goals and use these perspectives to identify the benefits and challenges of PD in NLP research. Our findings suggest that, while Caselli et al. (2021) and Delgado et al. (2023) provide valuable guidance, their application in research can be hindered by existing NLP practices, funding structures, and limited access to stakeholders. We propose that researchers adapt their PD praxis to the circumstances of specific projects and communities, using them as flexible guides rather than rigid prescriptions.

2024

pdf bib abs
Exploring the impact of data representation on neural data-to-text generation
David M. Howcroft | Lewis N. Watson | Olesia Nedopas | Dimitra Gkatzia
Proceedings of the 17th International Natural Language Generation Conference

A relatively under-explored area in research on neural natural language generation is the impact of the data representation on text quality. Here we report experiments on two leading input representations for data-to-text generation: attribute-value pairs and Resource Description Framework (RDF) triples. Evaluating the performance of encoder-decoder seq2seq models as well as recent large language models (LLMs) with both automated metrics and human evaluation, we find that the input representation does not seem to have a large impact on the performance of either purpose-built seq2seq models or LLMs. Finally, we present an error analysis of the texts generated by the LLMs and provide some insights into where these models fail.

Automatic metrics are extensively used to evaluate Natural Language Processing systems. However, there has been increasing focus on how the are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field.

pdf bib
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Tatsuya Kawahara | Vera Demberg | Stefan Ultes | Koji Inoue | Shikib Mehri | David Howcroft | Kazunori Komatani
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2023

pdf bib abs
Building a dual dataset of text- and image-grounded conversations and summarisation in Gàidhlig (Scottish Gaelic)
David M. Howcroft | William Lamb | Anna Groundwater | Dimitra Gkatzia
Proceedings of the 16th International Natural Language Generation Conference

Gàidhlig (Scottish Gaelic; gd) is spoken by about 57k people in Scotland, but remains an under-resourced language with respect to natural language processing in general and natural language generation (NLG) in particular. To address this gap, we developed the first datasets for Scottish Gaelic NLG, collecting both conversational and summarisation data in a single setting. Our task setup involves dialogues between a pair of speakers discussing museum exhibits, grounding the conversation in images and texts. Then, both interlocutors summarise the dialogue resulting in a secondary dialogue summarisation dataset. This paper presents the dialogue and summarisation corpora, as well as the software used for data collection. The corpus consists of 43 conversations (13.7k words) and 61 summaries (2.0k words), and will be released along with the data collection interface.

pdf bib abs
enunlg: a Python library for reproducible neural data-to-text experimentation
David M. Howcroft | Dimitra Gkatzia
Proceedings of the 16th International Natural Language Generation Conference: System Demonstrations

Over the past decade, a variety of neural architectures for data-to-text generation (NLG) have been proposed. However, each system typically has its own approach to pre- and post-processing and other implementation details. Diversity in implementations is desirable, but it also confounds attempts to compare model performance: are the differences due to the proposed architectures or are they a byproduct of the libraries used or a result of pre- and post-processing decisions made? To improve reproducibility, we re-implement several pre-Transformer neural models for data-to-text NLG within a single framework to facilitate direct comparisons of the models themselves and better understand the contributions of other design choices. We release our library at https://github.com/NapierNLP/enunlg to serve as a baseline for ongoing work in this area including research on NLG for low-resource languages where transformers might not be optimal.

Most languages in the world do not have sufficient data available to develop neural-network-based natural language generation (NLG) systems. To alleviate this resource scarcity, we propose a novel challenge for the NLG community: low-resource language corpus development (LOWRECORP). We present an innovative framework to collect a single dataset with dual tasks to maximize the efficiency of data collection efforts and respect language consultant time. Specifically, we focus on a text-chat-based interface for two generation tasks – conversational response generation grounded in a source document and/or image and dialogue summarization (from the former task). The goal of this shared task is to collectively develop grounded datasets for local and low-resourced languages. To enable data collection, we make available web-based software that can be used to collect these grounded conversations and summaries. Submissions will be assessed for the size, complexity, and diversity of the corpora to ensure quality control of the datasets as well as any enhancements to the interface or novel approaches to grounding conversations.

2022

pdf bib abs
Most NLG is Low-Resource: here’s what we can do about it
David M. Howcroft | Dimitra Gkatzia
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Many domains and tasks in natural language generation (NLG) are inherently ‘low-resource’, where training data, tools and linguistic analyses are scarce. This poses a particular challenge to researchers and system developers in the era of machine-learning-driven NLG. In this position paper, we initially present the challenges researchers & developers often encounter when dealing with low-resource settings in NLG. We then argue that it is unsustainable to collect large aligned datasets or build large language models from scratch for every possible domain due to cost, labour, and time constraints, so researching and developing methods and resources for low-resource settings is vital. We then discuss current approaches to low-resource NLG, followed by proposed solutions and promising avenues for future work in NLG for low-resource settings.

2021

pdf bib abs
OTTers: One-turn Topic Transitions for Open-Domain Dialogue
Karin Sevegnani | David M. Howcroft | Ioannis Konstas | Verena Rieser
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Mixed initiative in open-domain dialogue requires a system to pro-actively introduce new topics. The one-turn topic transition task explores how a system connects two topics in a cooperative and coherent manner. The goal of the task is to generate a “bridging” utterance connecting the new topic to the topic of the previous conversation turn. We are especially interested in commonsense explanations of how a new topic relates to what has been mentioned before. We first collect a new dataset of human one-turn topic transitions, which we callOTTers. We then explore different strategies used by humans when asked to complete such a task, and notice that the use of a bridging utterance to connect the two topics is the approach used the most. We finally show how existing state-of-the-art text generation models can be adapted to this task and examine the performance of these baselines on different splits of the OTTers data.

pdf bib abs
What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think
David M. Howcroft | Verena Rieser
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.

2020

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

pdf bib abs
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
Anya Belz | Simon Mille | David M. Howcroft
Proceedings of the 13th International Conference on Natural Language Generation

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.

2019

pdf bib abs
Semantic Noise Matters for Neural Natural Language Generation
Ondřej Dušek | David M. Howcroft | Verena Rieser
Proceedings of the 12th International Conference on Natural Language Generation

Neural natural language generation (NNLG) systems are known for their pathological outputs, i.e. generating text which is unrelated to the input specification. In this paper, we show the impact of semantic noise on state-of-the-art NNLG models which implement different semantic control mechanisms. We find that cleaned data can improve semantic correctness by up to 97%, while maintaining fluency. We also find that the most common error is omitting information, rather than hallucination.

2018

pdf bib abs
Toward Bayesian Synchronous Tree Substitution Grammars for Sentence Planning
David M. Howcroft | Dietrich Klakow | Vera Demberg
Proceedings of the 11th International Conference on Natural Language Generation

Developing conventional natural language generation systems requires extensive attention from human experts in order to craft complex sets of sentence planning rules. We propose a Bayesian nonparametric approach to learn sentence planning rules by inducing synchronous tree substitution grammars for pairs of text plans and morphosyntactically-specified dependency trees. Our system is able to learn rules which can be used to generate novel texts after training on small datasets.

2017

pdf bib abs
Psycholinguistic Models of Sentence Processing Improve Sentence Readability Ranking
David M. Howcroft | Vera Demberg
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

While previous research on readability has typically focused on document-level measures, recent work in areas such as natural language generation has pointed out the need of sentence-level readability measures. Much of psycholinguistics has focused for many years on processing measures that provide difficulty estimates on a word-by-word basis. However, these psycholinguistic measures have not yet been tested on sentence readability ranking tasks. In this paper, we use four psycholinguistic measures: idea density, surprisal, integration cost, and embedding depth to test whether these features are predictive of readability levels. We find that psycholinguistic features significantly improve performance by up to 3 percentage points over a standard document-level readability metric baseline.

pdf bib abs
G-TUNA: a corpus of referring expressions in German, including duration information
David Howcroft | Jorrig Vogels | Vera Demberg
Proceedings of the 10th International Conference on Natural Language Generation

Corpora of referring expressions elicited from human participants in a controlled environment are an important resource for research on automatic referring expression generation. We here present G-TUNA, a new corpus of referring expressions for German. Using the furniture stimuli set developed for the TUNA and D-TUNA corpora, our corpus extends on these corpora by providing data collected in a simulated driving dual-task setting, and additionally provides exact duration annotations for the spoken referring expressions. This corpus will hence allow researchers to analyze the interaction between referring expression length and speech rate, under conditions where the listener is under high vs. low cognitive load.

2016

pdf bib abs
From OpenCCG to AI Planning: Detecting Infeasible Edges in Sentence Generation
Maximilian Schwenger | Álvaro Torralba | Joerg Hoffmann | David M. Howcroft | Vera Demberg
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The search space in grammar-based natural language generation tasks can get very large, which is particularly problematic when generating long utterances or paragraphs. Using surface realization with OpenCCG as an example, we show that we can effectively detect partial solutions (edges) which cannot ultimately be part of a complete sentence because of their syntactic category. Formulating the completion of an edge into a sentence as finding a solution path in a large state-transition system, we demonstrate a connection to AI Planning which is concerned with this kind of problem. We design a compilation from OpenCCG into AI Planning allowing the detection of infeasible edges via AI Planning dead-end detection methods (proving the absence of a solution to the compilation). Our experiments show that this can filter out large fractions of infeasible edges in, and thus benefit the performance of, complex realization processes.