Marieke van Erp


2020

pdf bib
Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels
Ryan Brate | Paul Groth | Marieke van Erp
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them ‘smell experiences’, offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper, we present two variations on a semi-supervised approach to identify smell experiences in English literature. The combined set of patterns from both implementations offer significantly better performance than a keyword-based baseline.

pdf bib
Mining Wages in Nineteenth-Century Job Advertisements. The Application of Language Resources and Language Technology to study Economic and Social Inequality
Ruben Ros | Marieke van Erp | Auke Rijpma | Richard Zijdeman
Proceedings of the Workshop about Language Resources for the SSH Cloud

For the analysis of historical wage development, no structured data is available. Job advertisements, as found in newspapers can provide insights into what different types of jobs paid, but require language technology to structure in a format conducive to quantitative analysis. In this paper, we report on our experiments to mine wages from 19th century newspaper advertisements and detail the challenges that need to be overcome to perform a socio-economic analysis of textual data sources.

pdf bib
Towards Entity Spaces
Marieke van Erp | Paul Groth
Proceedings of the 12th Language Resources and Evaluation Conference

Entities are a central element of knowledge bases and are important input to many knowledge-centric tasks including text analysis. For example, they allow us to find documents relevant to a specific entity irrespective of the underlying syntactic expression within a document. However, the entities that are commonly represented in knowledge bases are often a simplification of what is truly being referred to in text. For example, in a knowledge base, we may have an entity for Germany as a country but not for the more fuzzy concept of Germany that covers notions of German Population, German Drivers, and the German Government. Inspired by recent advances in contextual word embeddings, we introduce the concept of entity spaces - specific representations of a set of associated entities with near-identity. Thus, these entity spaces provide a handle to an amorphous grouping of entities. We developed a proof-of-concept for English showing how, through the introduction of entity spaces in the form of disambiguation pages, the recall of entity linking can be improved.

2018

pdf bib
Proceedings of the Workshop Events and Stories in the News 2018
Tommaso Caselli | Ben Miller | Marieke van Erp | Piek Vossen | Martha Palmer | Eduard Hovy | Teruko Mitamura | David Caswell | Susan W. Brown | Claire Bonial
Proceedings of the Workshop Events and Stories in the News 2018

2017

pdf bib
Proceedings of the Events and Stories in the News Workshop
Tommaso Caselli | Ben Miller | Marieke van Erp | Piek Vossen | Martha Palmer | Eduard Hovy | Teruko Mitamura | David Caswell
Proceedings of the Events and Stories in the News Workshop

pdf bib
Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition
Leon Derczynski | Eric Nichols | Marieke van Erp | Nut Limsopatham
Proceedings of the 3rd Workshop on Noisy User-generated Text

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?!” – even human experts find the entity ‘kktny’ hard to detect and resolve. The goal of this task is to provide a definition of emerging and of rare entities, and based on that, also datasets for detecting these entities. The task as described in this paper evaluated the ability of participating entries to detect and classify novel and emerging named entities in noisy text.

2016

pdf bib
Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016)
Tommaso Caselli | Ben Miller | Marieke van Erp | Piek Vossen | David Caswell
Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016)

pdf bib
Moving away from semantic overfitting in disambiguation datasets
Marten Postma | Filip Ilievski | Piek Vossen | Marieke van Erp
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf bib
Context-enhanced Adaptive Entity Linking
Filip Ilievski | Giuseppe Rizzo | Marieke van Erp | Julien Plu | Raphaël Troncy
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

More and more knowledge bases are publicly available as linked data. Since these knowledge bases contain structured descriptions of real-world entities, they can be exploited by entity linking systems that anchor entity mentions from text to the most relevant resources describing those entities. In this paper, we investigate adaptation of the entity linking task using contextual knowledge. The key intuition is that entity linking can be customized depending on the textual content, as well as on the application that would make use of the extracted information. We present an adaptive approach that relies on contextual knowledge from text to enhance the performance of ADEL, a hybrid linguistic and graph-based entity linking system. We evaluate our approach on a domain-specific corpus consisting of annotated WikiNews articles.

pdf bib
Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job
Marieke van Erp | Pablo Mendes | Heiko Paulheim | Filip Ilievski | Julien Plu | Giuseppe Rizzo | Joerg Waitelonis
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Entity linking has become a popular task in both natural language processing and semantic web communities. However, we find that the benchmark datasets for entity linking tasks do not accurately evaluate entity linking systems. In this paper, we aim to chart the strengths and weaknesses of current benchmark datasets and sketch a roadmap for the community to devise better benchmark datasets.

pdf bib
MEANTIME, the NewsReader Multilingual Event and Time Corpus
Anne-Lyse Minard | Manuela Speranza | Ruben Urizar | Begoña Altuna | Marieke van Erp | Anneleen Schoen | Chantal van Son
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present the NewsReader MEANTIME corpus, a semantically annotated corpus of Wikinews articles. The corpus consists of 480 news articles, i.e. 120 English news articles and their translations in Spanish, Italian, and Dutch. MEANTIME contains annotations at different levels. The document-level annotation includes markables (e.g. entity mentions, event mentions, time expressions, and numerical expressions), relations between markables (modeling, for example, temporal information and semantic role labeling), and entity and event intra-document coreference. The corpus-level annotation includes entity and event cross-document coreference. Semantic annotation on the English section was performed manually; for the annotation in Italian, Spanish, and (partially) Dutch, a procedure was devised to automatically project the annotations on the English texts onto the translated texts, based on the manual alignment of the annotated elements; this enabled us not only to speed up the annotation process but also provided cross-lingual coreference. The English section of the corpus was extended with timeline annotations for the SemEval 2015 TimeLine shared task. The “First CLIN Dutch Shared Task” at CLIN26 was based on the Dutch section, while the EVALITA 2016 FactA (Event Factuality Annotation) shared task, based on the Italian section, is currently being organized.

2015

pdf bib
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
Kalliopi Zervanou | Marieke van Erp | Beatrice Alex
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Proceedings of the First Workshop on Computing News Storylines
Tommaso Caselli | Marieke van Erp | Anne-Lyse Minard | Mark Finlayson | Ben Miller | Jordi Atserias | Alexandra Balahur | Piek Vossen
Proceedings of the First Workshop on Computing News Storylines

pdf bib
SemEval-2015 Task 4: TimeLine: Cross-Document Event Ordering
Anne-Lyse Minard | Manuela Speranza | Eneko Agirre | Itziar Aldabe | Marieke van Erp | Bernardo Magnini | German Rigau | Rubén Urizar
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web
Giuseppe Rizzo | Marieke van Erp | Raphaël Troncy
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Named entity recognition and disambiguation are of primary importance for extracting information and for populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources, such as those in DBpedia, has been tackled by the Semantic Web community. As these tasks are treated in different communities, there is as yet no oversight on the performance of these tasks combined. We present an approach that combines the state-of-the art from named entity recognition in the natural language processing domain and named entity linking from the semantic web community. We report on experiments and results to gain more insights into the strengths and limitations of current approaches on these tasks. Our approach relies on the numerous web extractors supported by the NERD framework, which we combine with a machine learning algorithm to optimize recognition and linking of named entities. We test our approach on four standard data sets that are composed of two diverse text types, namely newswire and microposts.

pdf bib
Hope and Fear: How Opinions Influence Factuality
Chantal van Son | Marieke van Erp | Antske Fokkens | Piek Vossen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Both sentiment and event factuality are fundamental information levels for our understanding of events mentioned in news texts. Most research so far has focused on either modeling opinions or factuality. In this paper, we propose a model that combines the two for the extraction and interpretation of perspectives on events. By doing so, we can explain the way people perceive changes in (their belief of) the world as a function of their fears of changes to the bad or their hopes of changes to the good. This study seeks to examine the effectiveness of this approach by applying factuality annotations, based on FactBank, on top of the MPQA Corpus, a corpus containing news texts annotated for sentiments and other private states. Our findings suggest that this approach can be valuable for the understanding of perspectives, but that there is still some work to do on the refinement of the integration.

pdf bib
Discovering and Visualising Stories in News
Marieke van Erp | Gleb Satyukov | Piek Vossen | Marit Nijsen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Daily news streams often revolve around topics that span over a longer period of time such as the global financial crisis or the healthcare debate in the US. The length and depth of these stories can be such that they become difficult to track for information specialists who need to reconstruct exactly what happened for policy makers and companies. We present a framework to model stories from news: we describe the characteristics that make up interesting stories, how these translate to filters on our data and we present a first use case in which we detail the steps to visualising story lines extracted from news articles about the global automotive industry.

pdf bib
Proceedings of the Third Workshop on Semantic Web and Information Extraction
Diana Maynard | Marieke van Erp | Brian Davis
Proceedings of the Third Workshop on Semantic Web and Information Extraction

2013

pdf bib
GAF: A Grounded Annotation Framework for Events
Antske Fokkens | Marieke van Erp | Piek Vossen | Sara Tonelli | Willem Robert van Hage | Luciano Serafini | Rachele Sprugnoli | Jesper Hoeksema
Workshop on Events: Definition, Detection, Coreference, and Representation

pdf bib
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction
Diana Maynard | Marieke van Erp | Brian Davis | Petya Osenova | Kiril Simov | Georgi Georgiev | Preslav Nakov
Proceedings of the Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction

pdf bib
Offspring from Reproduction Problems: What Replication Failure Teaches Us
Antske Fokkens | Marieke van Erp | Marten Postma | Ted Pedersen | Piek Vossen | Nuno Freire
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2009

pdf bib
Instance-Driven Discovery of Ontological Relation Labels
Marieke van Erp | Antal van den Bosch | Sander Wubben | Steve Hunt
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

pdf bib
Comparing Alternative Data-Driven Ontological Vistas of Natural History (short paper)
Marieke van Erp | Piroska Lendvai | Antal van den Bosch
Proceedings of the Eight International Conference on Computational Semantics

2007

pdf bib
Retrieving Lost Information from Textual Databases: Rediscovering Expeditions from an Animal Specimen Database
Marieke van Erp
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

2006

pdf bib
Identifying Named Entities in Text Databases from the Natural History Domain
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch | Pim Arntzen
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we investigate whether it is possible to bootstrap a named entity tagger for textual databases by exploiting the database structure to automatically generate domain and database-specific gazetteer lists. We compare three tagging strategies: (i) using the extracted gazetteers in a look-up tagger, (ii) using the gazetteers to automatically extract training data to train a database-specific tagger, and (iii) using a generic named entity tagger. Our results suggest that automatically built gazetteers in combination with a look-up tagger lead to a relatively good performance and that generic taggers do not perform particularly well on this type of data.

pdf bib
Spotting the ‘Odd-one-out’: Data-Driven Error Detection and Correction in Textual Databases
Caroline Sporleder | Marieke van Erp | Tijn Porcelijn | Antal van den Bosch
Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006)