Hessel Haagsma

2020

pdf abs
MAGPIE: A Large Corpus of Potentially Idiomatic Expressions
Hessel Haagsma | Johan Bos | Malvina Nissim
Proceedings of the Twelfth Language Resources and Evaluation Conference

Given the limited size of existing idiom corpora, we aim to enable progress in automatic idiom processing and linguistic analysis by creating the largest-to-date corpus of idioms for English. Using a fixed idiom list, automatic pre-extraction, and a strictly controlled crowdsourced annotation procedure, we show that it is feasible to build a high-quality corpus comprising more than 50K instances, an order of a magnitude larger than previous resources. Crucial ingredients of crowdsourcing were the selection of crowdworkers, clear and comprehensive instructions, and an interface that breaks down the task in small, manageable steps. Analysis of the resulting corpus revealed strong effects of genre on idiom distribution, providing new evidence for existing theories on what influences idiom usage. The corpus also contains rich metadata, and is made publicly available.

2019

pdf bib
Proceedings of the IWCS Shared Task on Semantic Parsing
Lasha Abzianidze | Rik van Noord | Hessel Haagsma | Johan Bos
Proceedings of the IWCS Shared Task on Semantic Parsing

pdf bib abs
The First Shared Task on Discourse Representation Structure Parsing
Lasha Abzianidze | Rik van Noord | Hessel Haagsma | Johan Bos
Proceedings of the IWCS Shared Task on Semantic Parsing

The paper presents the IWCS 2019 shared task on semantic parsing where the goal is to produce Discourse Representation Structures (DRSs) for English sentences. DRSs originate from Discourse Representation Theory and represent scoped meaning representations that capture the semantics of negation, modals, quantification, and presupposition triggers. Additionally, concepts and event-participants in DRSs are described with WordNet synsets and the thematic roles from VerbNet. To measure similarity between two DRSs, they are represented in a clausal form, i.e. as a set of tuples. Participant systems were expected to produce DRSs in this clausal form. Taking into account the rich lexical information, explicit scope marking, a high number of shared variables among clauses, and highly-constrained format of valid DRSs, all these makes the DRS parsing a challenging NLP task. The results of the shared task displayed improvements over the existing state-of-the-art parser.

2018

pdf abs
The Other Side of the Coin: Unsupervised Disambiguation of Potentially Idiomatic Expressions by Contrasting Senses
Hessel Haagsma | Malvina Nissim | Johan Bos
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

Disambiguation of potentially idiomatic expressions involves determining the sense of a potentially idiomatic expression in a given context, e.g. determining that make hay in ‘Investment banks made hay while takeovers shone.’ is used in a figurative sense. This enables automatic interpretation of idiomatic expressions, which is important for applications like machine translation and sentiment analysis. In this work, we present an unsupervised approach for English that makes use of literalisations of idiom senses to improve disambiguation, which is based on the lexical cohesion graph-based method by Sporleder and Li (2009). Experimental results show that, while literalisation carries novel information, its performance falls short of that of state-of-the-art unsupervised methods.

pdf abs
PickleTeam! at SemEval-2018 Task 2: English and Spanish Emoji Prediction from Tweets
Daphne Groot | Rémon Kruizinga | Hennie Veldthuis | Simon de Wit | Hessel Haagsma
Proceedings of the 12th International Workshop on Semantic Evaluation

We present a system for emoji prediction on English and Spanish tweets, prepared for the SemEval-2018 task on Multilingual Emoji Prediction. We compared the performance of an SVM, LSTM and an ensemble of these two. We found the SVM performed best on our development set with an accuracy of 61.3% for English and 83% for Spanish. The features used for the SVM are lowercased word n-grams in the range of 1 to 20, tokenised by a TweetTokenizer and stripped of stop words. On the test set, our model achieved an accuracy of 34% on English, with a slightly lower score of 29.7% accuracy on Spanish.

pdf
Evaluating Scoped Meaning Representations
Rik van Noord | Lasha Abzianidze | Hessel Haagsma | Johan Bos
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations
Lasha Abzianidze | Johannes Bjerva | Kilian Evang | Hessel Haagsma | Rik van Noord | Pierre Ludmann | Duc-Duy Nguyen | Johan Bos
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The Parallel Meaning Bank is a corpus of translations annotated with shared, formal meaning representations comprising over 11 million words divided over four languages (English, German, Italian, and Dutch). Our approach is based on cross-lingual projection: automatically produced (and manually corrected) semantic annotations for English sentences are mapped onto their word-aligned translations, assuming that the translations are meaning-preserving. The semantic annotation consists of five main steps: (i) segmentation of the text in sentences and lexical items; (ii) syntactic parsing with Combinatory Categorial Grammar; (iii) universal semantic tagging; (iv) symbolization; and (v) compositional semantic analysis based on Discourse Representation Theory. These steps are performed using statistical models trained in a semi-supervised manner. The employed annotation models are all language-neutral. Our first results are promising.

Hessel Haagsma

2020

2019

2018

2017

2016

Co-authors

Venues