Emiel Krahmer

Also published as: Emiel J. Krahmer

2024

pdf abs
To What Extent Are Large Language Models Capable of Generating Substantial Reflections for Motivational Interviewing Counseling Chatbots? A Human Evaluation
Erkan Basar | Iris Hendrickx | Emiel Krahmer | Gert-Jan Bruijn | Tibor Bosse
Proceedings of the 1st Human-Centered Large Language Modeling Workshop

Motivational Interviewing is a counselling style that requires skillful usage of reflective listening and engaging in conversations about sensitive and personal subjects. In this paper, we investigate to what extent we can use generative large language models in motivational interviewing chatbots to generate precise and variable reflections on user responses. We conduct a two-step human evaluation where we first independently assess the generated reflections based on four criteria essential to health counseling; appropriateness, specificity, naturalness, and engagement. In the second step, we compare the overall quality of generated and human-authored reflections via a ranking evaluation. We use GPT-4, BLOOM, and FLAN-T5 models to generate motivational interviewing reflections, based on real conversational data collected via chatbots designed to provide support for smoking cessation and sexual health. We discover that GPT-4 can produce reflections of a quality comparable to human-authored reflections. Finally, we conclude that large language models have the potential to enhance and expand reflections in predetermined health counseling chatbots, but a comprehensive manual review is advised.

pdf abs
ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022
Emiel van Miltenburg | Anouck Braggaar | Nadine Braun | Martijn Goudbeek | Emiel Krahmer | Chris van der Lee | Steffen Pauws | Frédéric Tomas
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

In earlier work, August et al. (2022) evaluated three different Natural Language Generation systems on their ability to generate fluent, relevant, and factual scientific definitions. As part of the ReproHum project (Belz et al., 2023), we carried out a partial reproduction study of their human evaluation procedure, focusing on human fluency ratings. Following the standardised ReproHum procedure, our reproduction study follows the original study as closely as possible, with two raters providing 300 ratings each. In addition to this, we carried out a second study where we collected ratings from eight additional raters and analysed the variability of the ratings. We successfully reproduced the inferential statistics from the original study (i.e. the same hypotheses were supported), albeit with a lower inter-annotator agreement. The remainder of our paper shows significant variation between different raters, raising questions about what it really means to reproduce human evaluation studies.

pdf abs
Eliciting Motivational Interviewing Skill Codes in Psychotherapy with LLMs: A Bilingual Dataset and Analytical Study
Xin Sun | Jiahuan Pei | Jan de Wit | Mohammad Aliannejadi | Emiel Krahmer | Jos T.P. Dobber | Jos A. Bosch
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Behavioral coding (BC) in motivational interviewing (MI) holds great potential for enhancing the efficacy of MI counseling. However, manual coding is labor-intensive, and automation efforts are hindered by the lack of data due to the privacy of psychotherapy. To address these challenges, we introduce BiMISC, a bilingual dataset of MI conversations in English and Dutch, sourced from real counseling sessions. Expert annotations in BiMISC adhere strictly to the motivational interviewing skills code (MISC) scheme, offering a pivotal resource for MI research. Additionally, we present a novel approach to elicit the MISC expertise from Large language models (LLMs) for MI coding. Through the in-depth analysis of BiMISC and the evaluation of our proposed approach, we demonstrate that the LLM-based approach yields results closely aligned with expert annotations and maintains consistent performance across different languages. Our contributions not only furnish the MI community with a valuable bilingual dataset but also spotlight the potential of LLMs in MI coding, laying the foundation for future MI research.

2023

pdf bib abs
Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model
Chris van der Lee | Thiago Castro Ferreira | Chris Emmery | Travis J. Wiltshire | Emiel Krahmer
Computational Linguistics, Volume 49, Issue 3 - September 2023

This study discusses the effect of semi-supervised learning in combination with pretrained language models for data-to-text generation. It is not known whether semi-supervised learning is still helpful when a large-scale language model is also supplemented. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text systems that are additionally enriched by a data augmentation or a pseudo-labeling semi-supervised learning approach. Results show that semi-supervised learning results in higher scores on diversity metrics. In terms of output quality, extending the training set of a data-to-text system with a language model using the pseudo-labeling approach did increase text quality scores, but the data augmentation approach yielded similar scores to the system without training set extension. These results indicate that semi-supervised learning approaches can bolster output quality and diversity, even when a language model is also present.

This paper is part of the larger ReproHum project, where different teams of researchers aim to reproduce published experiments from the NLP literature. Specifically, ReproHum focuses on the reproducibility of human evaluation studies, where participants indicate the quality of different outputs of Natural Language Generation (NLG) systems. This is necessary because without reproduction studies, we do not know how reliable earlier results are. This paper aims to reproduce the second human evaluation study of Puduppully & Lapata (2021), while another lab is attempting to do the same. This experiment uses best-worst scaling to determine the relative performance of different NLG systems. We found that the worst performing system in the original study is now in fact the best performing system across the board. This means that we cannot fully reproduce the original results. We also carry out alternative analyses of the data, and discuss how our results may be combined with the other reproduction study that is carried out in parallel with this paper.

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

2022

In this paper, we describe our reproduction ef- fort of the paper: Towards Best Experiment Design for Evaluating Dialogue System Output by Santhanam and Shaikh (2019) for the 2022 ReproGen shared task. We aim to produce the same results, using different human evaluators, and a different implementation of the automatic metrics used in the original paper. Although overall the study posed some challenges to re- produce (e.g. difficulties with reproduction of automatic metrics and statistics), in the end we did find that the results generally replicate the findings of Santhanam and Shaikh (2019) and seem to follow similar trends.

pdf bib
Proceedings of the First Workshop on Natural Language Generation in Healthcare
Emiel Krahmer | Kathy McCoy | Ehud Reiter
Proceedings of the First Workshop on Natural Language Generation in Healthcare

2021

pdf abs
Preregistering NLP research
Emiel van Miltenburg | Chris van der Lee | Emiel Krahmer
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Preregistration refers to the practice of specifying what you are going to do, and what you expect to find in your study, before carrying out the study. This practice is increasingly common in medicine and psychology, but is rarely discussed in NLP. This paper discusses preregistration in more detail, explores how NLP researchers could preregister their work, and presents several preregistration questions for different kinds of studies. Finally, we argue in favour of registered reports, which could provide firmer grounds for slow science in NLP research. The goal of this paper is to elicit a discussion in the NLP community, which we hope to synthesise into a general NLP preregistration form in future research.

2020

pdf abs
Evaluation rules! On the use of grammars and rule-based systems for NLG evaluation
Emiel van Miltenburg | Chris van der Lee | Thiago Castro-Ferreira | Emiel Krahmer
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

NLG researchers often use uncontrolled corpora to train and evaluate their systems, using textual similarity metrics, such as BLEU. This position paper argues in favour of two alternative evaluation strategies, using grammars or rule-based systems. These strategies are particularly useful to identify the strengths and weaknesses of different systems. We contrast our proposals with the (extended) WebNLG dataset, which is revealed to have a skewed distribution of predicates. We predict that this distribution affects the quality of the predictions for systems trained on this data. However, this hypothesis can only be thoroughly tested (without any confounds) once we are able to systematically manipulate the skewness of the data, using a rule-based approach.

pdf abs
The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation
Chris van der Lee | Chris Emmery | Sander Wubben | Emiel Krahmer
Proceedings of the 13th International Conference on Natural Language Generation

This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.

Earlier research has shown that evaluation metrics based on textual similarity (e.g., BLEU, CIDEr, Meteor) do not correlate well with human evaluation scores for automatically generated text. We carried out an experiment with Chinese speakers, where we systematically manipulated image descriptions to contain different kinds of errors. Because our manipulated descriptions form minimal pairs with the reference descriptions, we are able to assess the impact of different kinds of errors on the perceived quality of the descriptions. Our results show that different kinds of errors elicit significantly different evaluation scores, even though all erroneous descriptions differ in only one character from the reference descriptions. Evaluation metrics based solely on textual similarity are unable to capture these differences, which (at least partially) explains their poor correlation with human judgments. Our work provides the foundations for future work, where we aim to understand why different errors are seen as more or less severe.

2019

pdf abs
Neural data-to-text generation: A comparison between pipeline and end-to-end architectures
Thiago Castro Ferreira | Chris van der Lee | Emiel van Miltenburg | Emiel Krahmer
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. By contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in between. This study introduces a systematic comparison between neural pipeline and end-to-end data-to-text approaches for the generation of text from RDF triples. Both architectures were implemented making use of the encoder-decoder Gated-Recurrent Units (GRU) and Transformer, two state-of-the art deep learning methods. Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches. Moreover, the pipeline models generalize better to unseen inputs. Data and code are publicly available.

pdf abs
Automatic identification of writers’ intentions: Comparing different methods for predicting relationship goals in online dating profile texts
Chris van der Lee | Tess van der Zanden | Emiel Krahmer | Maria Mos | Alexander Schouten
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Psychologically motivated, lexicon-based text analysis methods such as LIWC (Pennebaker et al., 2015) have been criticized by computational linguists for their lack of adaptability, but they have not often been systematically compared with either human evaluations or machine learning approaches. The goal of the current study was to assess the effectiveness and predictive ability of LIWC on a relationship goal classification task. In this paper, we compared the outcomes of (1) LIWC, (2) machine learning, and (3) a human baseline. A newly collected corpus of online dating profile texts (a genre not explored before in the ACL anthology) was used, accompanied by the profile writers’ self-selected relationship goal (long-term versus date). These three approaches were tested by comparing their performance on identifying both the intended relationship goal and content-related text labels. Results show that LIWC and machine learning models correlate with human evaluations in terms of content-related labels. LIWC’s content-related labels corresponded more strongly to humans than those of the classifier. Moreover, all approaches were similarly accurate in predicting the relationship goal.

pdf abs
Surface Realization Shared Task 2019 (MSR19): The Team 6 Approach
Thiago Castro Ferreira | Emiel Krahmer
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

This study describes the approach developed by the Tilburg University team to the shallow track of the Multilingual Surface Realization Shared Task 2019 (SR’19) (Mille et al., 2019). Based on Ferreira et al. (2017) and on our 2018 submission Ferreira et al. (2018), the approach generates texts by first preprocessing an input dependency tree into an ordered linearized string, which is then realized using a rule-based and a statistical machine translation (SMT) model. This year our submission is able to realize texts in the 11 languages proposed for the task, different from our last year submission, which covered only 6 Indo-European languages. The model is publicly available.

pdf abs
Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models
Florian Kunneman | Thiago Castro Ferreira | Emiel Krahmer | Antal van den Bosch
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Community Question Answering forums are popular among Internet users, and a basic problem they encounter is trying to find out if their question has already been posed before. To address this issue, NLP researchers have developed methods to automatically detect question-similarity, which was one of the shared tasks in SemEval. The best performing systems for this task made use of Syntactic Tree Kernels or the SoftCosine metric. However, it remains unclear why these methods seem to work, whether their performance can be improved by better preprocessing methods and what kinds of errors they (and other methods) make. In this paper, we therefore systematically combine and compare these two approaches with the more traditional BM25 and translation-based models. Moreover, we analyze the impact of preprocessing steps (lowercasing, suppression of punctuation and stop words removal) and word meaning similarity based on different distributions (word translation probability, Word2Vec, fastText and ELMo) on the performance of the task. We conduct an error analysis to gain insight into the differences in performance between the system set-ups. The implementation is made publicly available from https://github.com/fkunneman/DiscoSumo/tree/master/ranlp.

pdf abs
Best practices for the human evaluation of automatically generated text
Chris van der Lee | Albert Gatt | Emiel van Miltenburg | Sander Wubben | Emiel Krahmer
Proceedings of the 12th International Conference on Natural Language Generation

Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated. While there is some agreement regarding automatic metrics, there is a high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature. With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.

pdf abs
On task effects in NLG corpus elicitation: a replication study using mixed effects modeling
Emiel van Miltenburg | Merel van de Kerkhof | Ruud Koolen | Martijn Goudbeek | Emiel Krahmer
Proceedings of the 12th International Conference on Natural Language Generation

Task effects in NLG corpus elicitation recently started to receive more attention, but are usually not modeled statistically. We present a controlled replication of the study by Van Miltenburg et al. (2018b), contrasting spoken with written descriptions. We collected additional written Dutch descriptions to supplement the spoken data from the DIDEC corpus, and analyzed the descriptions using mixed effects modeling to account for variation between participants and items. Our results show that the effects of modality largely disappear in a controlled setting.

In this paper, we present a novel data-to-text system for cancer patients, providing information on quality of life implications after treatment, which can be embedded in the context of shared decision making. Currently, information on quality of life implications is often not discussed, partly because (until recently) data has been lacking. In our work, we rely on a newly developed prediction model, which assigns patients to scenarios. Furthermore, we use data-to-text techniques to explain these scenario-based predictions in personalized and understandable language. We highlight the possibilities of NLG for personalization, discuss ethical implications and also present the outcomes of a first evaluation with clinicians.

2018

pdf abs
Evaluating the text quality, human likeness and tailoring component of PASS: A Dutch data-to-text system for soccer
Chris van der Lee | Bart Verduijn | Emiel Krahmer | Sander Wubben
Proceedings of the 27th International Conference on Computational Linguistics

We present an evaluation of PASS, a data-to-text system that generates Dutch soccer reports from match statistics which are automatically tailored towards fans of one club or the other. The evaluation in this paper consists of two studies. An intrinsic human-based evaluation of the system’s output is described in the first study. In this study it was found that compared to human-written texts, computer-generated texts were rated slightly lower on style-related text components (fluency and clarity) and slightly higher in terms of the correctness of given information. Furthermore, results from the first study showed that tailoring was accurately recognized in most cases, and that participants struggled with correctly identifying whether a text was written by a human or computer. The second study investigated if tailoring affects perceived text quality, for which no results were garnered. This lack of results might be due to negative preconceptions about computer-generated texts which were found in the first study.

pdf abs
Aspect-based summarization of pros and cons in unstructured product reviews
Florian Kunneman | Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 27th International Conference on Computational Linguistics

We developed three systems for generating pros and cons summaries of product reviews. Automating this task eases the writing of product reviews, and offers readers quick access to the most important information. We compared SynPat, a system based on syntactic phrases selected on the basis of valence scores, against a neural-network-based system trained to map bag-of-words representations of reviews directly to pros and cons, and the same neural system trained on clusters of word-embedding encodings of similar pros and cons. We evaluated the systems in two ways: first on held-out reviews with gold-standard pros and cons, and second by asking human annotators to rate the systems’ output on relevance and completeness. In the second evaluation, the gold-standard pros and cons were assessed along with the system output. We find that the human-generated summaries are not deemed as significantly more relevant or complete than the SynPat systems; the latter are scored higher than the human-generated summaries on a precision metric. The neural approaches yield a lower performance in the human assessment, and are outperformed by the baseline.

pdf abs
DIDEC: The Dutch Image Description and Eye-tracking Corpus
Emiel van Miltenburg | Ákos Kádár | Ruud Koolen | Emiel Krahmer
Proceedings of the 27th International Conference on Computational Linguistics

We present a corpus of spoken Dutch image descriptions, paired with two sets of eye-tracking data: Free viewing, where participants look at images without any particular purpose, and Description viewing, where we track eye movements while participants produce spoken descriptions of the images they are viewing. This paper describes the data collection procedure and the corpus itself, and provides an initial analysis of self-corrections in image descriptions. We also present two studies showing the potential of this data. Though these studies mainly serve as an example, we do find two interesting results: (1) the eye-tracking data for the description viewing task is more coherent than for the free-viewing task; (2) variation in image descriptions (also called ‘image specificity’; Jas and Parikh, 2015) is only moderately correlated across different languages. Our corpus can be used to gain a deeper understanding of the image description task, particularly how visual attention is correlated with the image description process.

pdf abs
NeuralREG: An end-to-end approach to referring expression generation
Thiago Castro Ferreira | Diego Moussallem | Ákos Kádár | Sander Wubben | Emiel Krahmer
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Traditionally, Referring Expression Generation (REG) models first decide on the form and then on the content of references to discourse entities in text, typically relying on features such as salience and grammatical function. In this paper, we present a new approach (NeuralREG), relying on deep neural networks, which makes decisions about form and content in one go without explicit feature extraction. Using a delexicalized version of the WebNLG corpus, we show that the neural model substantially improves over two strong baselines.

pdf abs
Surface Realization Shared Task 2018 (SR18): The Tilburg University Approach
Thiago Castro Ferreira | Sander Wubben | Emiel Krahmer
Proceedings of the First Workshop on Multilingual Surface Realisation

This study describes the approach developed by the Tilburg University team to the shallow task of the Multilingual Surface Realization Shared Task 2018 (SR18). Based on (Castro Ferreira et al., 2017), the approach works by first preprocessing an input dependency tree into an ordered linearized string, which is then realized using a statistical machine translation model. Our approach shows promising results, with BLEU scores above 50 for 5 different languages (English, French, Italian, Portuguese and Spanish) and above 35 for the Dutch language.

pdf abs
Varying image description tasks: spoken versus written descriptions
Emiel van Miltenburg | Ruud Koolen | Emiel Krahmer
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Automatic image description systems are commonly trained and evaluated on written image descriptions. At the same time, these systems are often used to provide spoken descriptions (e.g. for visually impaired users) through apps like TapTapSee or Seeing AI. This is not a problem, as long as spoken and written descriptions are very similar. However, linguistic research suggests that spoken language often differs from written language. These differences are not regular, and vary from context to context. Therefore, this paper investigates whether there are differences between written and spoken image descriptions, even if they are elicited through similar tasks. We compare descriptions produced in two languages (English and Dutch), and in both languages observe substantial differences between spoken and written descriptions. Future research should see if users prefer the spoken over the written style and, if so, aim to emulate spoken descriptions.

pdf bib
Proceedings of the 11th International Conference on Natural Language Generation
Emiel Krahmer | Albert Gatt | Martijn Goudbeek
Proceedings of the 11th International Conference on Natural Language Generation

pdf abs
Automated learning of templates for data-to-text generation: comparing rule-based, statistical and neural methods
Chris van der Lee | Emiel Krahmer | Sander Wubben
Proceedings of the 11th International Conference on Natural Language Generation

The current study investigated novel techniques and methods for trainable approaches to data-to-text generation. Neural Machine Translation was explored for the conversion from data to text as well as the addition of extra templatization steps of the data input and text output in the conversion process. Evaluation using BLEU did not find the Neural Machine Translation technique to perform any better compared to rule-based or Statistical Machine Translation, and the templatization method seemed to perform similarly or sometimes worse compared to direct data-to-text conversion. However, the human evaluation metrics indicated that Neural Machine Translation yielded the highest quality output and that the templatization method was able to increase text quality in multiple situations.

pdf abs
Enriching the WebNLG corpus
Thiago Castro Ferreira | Diego Moussallem | Emiel Krahmer | Sander Wubben
Proceedings of the 11th International Conference on Natural Language Generation

This paper describes the enrichment of WebNLG corpus (Gardent et al., 2017a,b), with the aim to further extend its usefulness as a resource for evaluating common NLG tasks, including Discourse Ordering, Lexicalization and Referring Expression Generation. We also produce a silver-standard German translation of the corpus to enable the exploitation of NLG approaches to other languages than English. The enriched corpus is publicly available.

pdf bib abs
Context-sensitive Natural Language Generation for robot-assisted second language tutoring
Bram Willemsen | Jan de Wit | Emiel Krahmer | Mirjam de Haas | Paul Vogt
Proceedings of the Workshop on NLG for Human–Robot Interaction

This paper describes the L2TOR intelligent tutoring system (ITS), focusing primarily on its output generation module. The L2TOR ITS is developed for the purpose of investigating the efficacy of robot-assisted second language tutoring in early childhood. We explain the process of generating contextually-relevant utterances, such as task-specific feedback messages, and discuss challenges regarding multimodality and multilingualism for situated natural language generation from a robot tutoring perspective.

2017

pdf abs
Generating flexible proper name references in text: Data, models and evaluation
Thiago Castro Ferreira | Emiel Krahmer | Sander Wubben
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This study introduces a statistical model able to generate variations of a proper name by taking into account the person to be mentioned, the discourse context and variation. The model relies on the REGnames corpus, a dataset with 53,102 proper name references to 1,000 people in different discourse contexts. We evaluate the versions of our model from the perspective of how human writers produce proper names, and also how human readers process them. The corpus and the model are publicly available.

pdf bib abs
Linguistic realisation as machine translation: Comparing different MT models for AMR-to-text generation
Thiago Castro Ferreira | Iacer Calixto | Sander Wubben | Emiel Krahmer
Proceedings of the 10th International Conference on Natural Language Generation

In this paper, we study AMR-to-text generation, framing it as a translation task and comparing two different MT approaches (Phrase-based and Neural MT). We systematically study the effects of 3 AMR preprocessing steps (Delexicalisation, Compression, and Linearisation) applied before the MT phase. Our results show that preprocessing indeed helps, although the benefits differ for the two MT models.

pdf abs
PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences
Chris van der Lee | Emiel Krahmer | Sander Wubben
Proceedings of the 10th International Conference on Natural Language Generation

We present PASS, a data-to-text system that generates Dutch soccer reports from match statistics. One of the novel elements of PASS is the fact that the system produces corpus-based texts tailored towards fans of one club or the other, which can most prominently be observed in the tone of voice used in the reports. Furthermore, the system is open source and uses a modular design, which makes it relatively easy for people to add extensions. Human-based evaluation shows that people are generally positive towards PASS in regards to its clarity and fluency, and that the tailoring is accurately recognized in most cases.

In this paper we investigate the automatic generation of paraphrases by using machine translation techniques. Three contributions we make are the construction of a large paraphrase corpus for English and Dutch, a re-ranking heuristic to use machine translation for paraphrase generation and a proper evaluation methodology. A large parallel corpus is constructed by aligning clustered headlines that are scraped from a news aggregator site. To generate sentential paraphrases we use a standard phrase-based machine translation (PBMT) framework modified with a re-ranking component (henceforth PBMT-R). We demonstrate this approach for Dutch and English and evaluate by using human judgements collected from 76 participants. The judgments are compared to two automatic machine translation evaluation metrics. We observe that as the paraphrases deviate more from the source sentence, the performance of the PBMT-R system degrades less than that of the word substitution baseline system.

2013

pdf
Graphs and Spatial Relations in the Generation of Referring Expressions
Jette Viethen | Margaret Mitchell | Emiel Krahmer
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib
Using character overlap to improve language transformation
Sander Wubben | Emiel Krahmer | Antal van den Bosch
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2012

pdf
Computational Generation of Referring Expressions: A Survey
Emiel Krahmer | Kees van Deemter
Computational Linguistics, Volume 38, Issue 1 - March 2012

pdf
Sentence Simplification by Monolingual Machine Translation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Learning Preferences for Referring Expression Generation: Effects of Domain, Language and Algorithm
Ruud Koolen | Emiel Krahmer | Mariët Theune
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

2011

pdf
Does Size Matter – How Much Data is Required to Train a REG Algorithm?
Mariët Theune | Ruud Koolen | Emiel Krahmer | Sander Wubben
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Comparing Phrase-based and Syntax-based Paraphrase Generation
Sander Wubben | Erwin Marsi | Antal van den Bosch | Emiel Krahmer
Proceedings of the Workshop on Monolingual Text-To-Text Generation

2010

pdf
Automatic analysis of semantic similarity in comparable text through syntactic tree matching
Erwin Marsi | Emiel Krahmer
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Last Words: What Computational Linguists Can Learn from Psychologists (and Vice Versa)
Emiel Krahmer
Computational Linguistics, Volume 36, Number 2, June 2010

pdf abs
The D-TUNA Corpus: A Dutch Dataset for the Evaluation of Referring Expression Generation Algorithms
Ruud Koolen | Emiel Krahmer
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present the D-TUNA corpus, which is the first semantically annotated corpus of referring expressions in Dutch. Its primary function is to evaluate and improve the performance of REG algorithms. Such algorithms are computational models that automatically generate referring expressions by computing how a specific target can be identified to an addressee by distinguishing it from a set of distractor objects. We performed a large-scale production experiment, in which participants were asked to describe furniture items and people, and provided all descriptions with semantic information regarding the target and the distractor objects. Besides being useful for evaluating REG algorithms, the corpus addresses several other research goals. Firstly, the corpus contains both written and spoken referring expressions uttered in the direction of an addressee, which enables systematic analyses of how modality (text or speech) influences the human production of referring expressions. Secondly, due to its comparability with the English TUNA corpus, our Dutch corpus can be used to explore the differences between Dutch and English speakers regarding the production of referring expressions.

pdf abs
Human Language Technology and Communicative Disabilities: Requirements and Possibilities for the Future
Marina B. Ruiter | Toni C. M. Rietveld | Catia Cucchiarini | Emiel J. Krahmer | Helmer Strik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For some years now, the Nederlandse Taalunie (Dutch Language Union) has been active in promoting the development of human language technology (HLT) applications for users of Dutch with communication disabilities. The reason is that HLT products and services may enable these users to improve their verbal autonomy and communication skills. We sought to identify a minimum common set of HLT resources that is required to develop tools for a wide range of communication disabilities. In order to reach this goal, we investigated the specific HLT needs of communicatively disabled people and related these needs to the underlying HLT software components. By analysing the availability and quality of these essential HLT resources, we were able to identify which of the crucial elements need further research and development to become usable for developing applications for communicatively disabled users of Dutch. The results obtained in the current survey can be used to inform policy institutions on how they can stimulate the development of HLT resources for this target group. In the current study results were obtained for Dutch, but a similar approach can also be used for other languages.

pdf
Preferences versus Adaptation during Referring Expression Generation
Martijn Goudbeek | Emiel Krahmer
Proceedings of the ACL 2010 Conference Short Papers

pdf
Cross-linguistic Attribute Selection for REG: Comparing Dutch and English
Mariët Theune | Ruud Koolen | Emiel Krahmer
Proceedings of the 6th International Natural Language Generation Conference

pdf
Paraphrase Generation as Monolingual Translation: Data and Evaluation
Sander Wubben | Antal van den Bosch | Emiel Krahmer
Proceedings of the 6th International Natural Language Generation Conference

2009

pdf bib
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)
Emiel Krahmer | Mariët Theune
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf
Is Sentence Compression an NLG task?
Erwin Marsi | Emiel Krahmer | Iris Hendrickx | Walter Daelemans
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf
Clustering and Matching Headlines for Automatic Paraphrase Acquisition
Sander Wubben | Antal van den Bosch | Emiel Krahmer | Erwin Marsi
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf
Realizing the Costs: Template-Based Surface Realisation in the GRAPH Approach to Referring Expression Generation
Ivo Brugman | Mariët Theune | Emiel Krahmer | Jette Viethen
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf
Reducing Redundancy in Multi-document Summarization Using Lexical Semantic Similarity
Iris Hendrickx | Walter Daelemans | Erwin Marsi | Emiel Krahmer
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

2008

pdf abs
Controlling Redundancy in Referring Expressions
Jette Viethen | Robert Dale | Emiel Krahmer | Mariët Theune | Pascal Touset
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Krahmer et al.s (2003) graph-based framework provides an elegant and flexible approach to the generation of referring expressions. In this paper, we present the first reported study that systematically investigates how to tune the parameters of the graph-based framework on the basis of a corpus of human-generated descriptions. We focus in particular on replicating the redundant nature of human referring expressions, whereby properties not strictly necessary for identifying a referent are nonetheless included in descriptions. We show how statistics derived from the corpus data can be integrated to boost the frameworks performance over a non-stochastic baseline.

pdf
Query-based Sentence Fusion is Better Defined and Leads to More Preferred Results than Generic Sentence Fusion
Emiel Krahmer | Erwin Marsi | Paul van Pelt
Proceedings of ACL-08: HLT, Short Papers

pdf
GRAPH: The Costs of Redundancy in Referring Expressions
Emiel Krahmer | Mariët Theune | Jette Viethen | Iris Hendrickx
Proceedings of the Fifth International Natural Language Generation Conference