Lonneke van der Plas

Also published as: Lonneke Van Der Plas


2021

pdf bib
On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Marc Tanti | Lonneke van der Plas | Claudia Borg | Albert Gatt
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks – POS tagging and natural language inference – which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on ‘unlearning’ language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model’s limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.

2020

pdf bib
Word Probability Findings in the Voynich Manuscript
Colin Layfield | Lonneke van der Plas | Michael Rosner | John Abela
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

The Voynich Manuscript has baffled scholars for centuries. Some believe the elaborate 15th century codex to be a hoax whilst others believe it is a real medieval manuscript whose contents are as yet unknown. In this paper, we provide additional evidence that the text of the manuscript displays the hallmarks of a proper natural language with respect to the relationship between word probabilities and (i) average information per subword segment and (ii) the relative positioning of consecutive subword segments necessary to uniquely identify words of different probabilities.

pdf bib
Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
Stavros Assimakopoulos | Rebecca Vella Muskat | Lonneke van der Plas | Albert Gatt
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realisation that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realised through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label ‘hate speech,’ as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we propose a multi-layer annotation scheme, which is pilot-tested against a binary ±hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years.

pdf bib
MASRI-HEADSET: A Maltese Corpus for Speech Recognition
Carlos Daniel Hernandez Mena | Albert Gatt | Andrea DeMarco | Claudia Borg | Lonneke van der Plas | Amanda Muscat | Ian Padovani
Proceedings of the 12th Language Resources and Evaluation Conference

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI HEADSET Corpus is publicly available for research/academic purposes.

2019

pdf bib
Measuring the Compositionality of Noun-Noun Compounds over Time
Prajit Dhar | Janis Pagel | Lonneke van der Plas
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We present work in progress on the temporal progression of compositionality in noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts. We hypothesize that such a property might change over time. We use the time-stamped Google Books corpus for our diachronic investigations, and first examine whether the vector-based semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations. We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora. Finally, we show changes in compositionality over time for a selection of compounds.

pdf bib
Learning to Predict Novel Noun-Noun Compounds
Prajit Dhar | Lonneke van der Plas
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We introduce temporally and contextually-aware models for the novel task of predicting unseen but plausible concepts, as conveyed by noun-noun compounds in a time-stamped corpus. We train compositional models on observed compounds, more specifically the composed distributed representations of their constituents across a time-stamped corpus, while giving it corrupted instances (where head or modifier are replaced by a random constituent) as negative evidence. The model captures generalisations over this data and learns what combinations give rise to plausible compounds and which ones do not. After training, we query the model for the plausibility of automatically generated novel combinations and verify whether the classifications are accurate. For our best model, we find that in around 85% of the cases, the novel compounds generated are attested in previously unseen data. An additional estimated 5% are plausible despite not being attested in the recent corpus, based on judgments from independent human raters.

2018

pdf bib
Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions
Albert Gatt | Marc Tanti | Adrian Muscat | Patrizia Paggio | Reuben A Farrugia | Claudia Borg | Kenneth P Camilleri | Michael Rosner | Lonneke van der Plas
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Survey: Multiword Expression Processing: A Survey
Mathieu Constant | Gülşen Eryiǧit | Johanna Monti | Lonneke van der Plas | Carlos Ramisch | Michael Rosner | Amalia Todirascu
Computational Linguistics, Volume 43, Issue 4 - December 2017

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.

pdf bib
Evaluating Compound Splitters Extrinsically with Textual Entailment
Glorianna Jagfeld | Patrick Ziering | Lonneke van der Plas
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Traditionally, compound splitters are evaluated intrinsically on gold-standard data or extrinsically on the task of statistical machine translation. We explore a novel way for the extrinsic evaluation of compound splitters, namely recognizing textual entailment. Compound splitting has great potential for this novel task that is both transparent and well-defined. Moreover, we show that it addresses certain aspects that are either ignored in intrinsic evaluations or compensated for by taskinternal mechanisms in statistical machine translation. We show significant improvements using different compound splitting methods on a German textual entailment dataset.

pdf bib
LCT-MALTA’s Submission to RepEval 2017 Shared Task
Hoa Trong Vu | Thuong-Hai Pham | Xiaoyu Bai | Marc Tanti | Lonneke van der Plas | Albert Gatt
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

System using BiLSTM and max pooling. Embedding is enhanced by POS, character and dependency info.

2016

pdf bib
Towards Unsupervised and Language-independent Compound Splitting using Inflectional Morphological Transformations
Patrick Ziering | Lonneke van der Plas
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Top a Splitter: Using Distributional Semantics for Improving Compound Splitting
Patrick Ziering | Stefan Müller | Lonneke van der Plas
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
The Grammar of English Deverbal Compounds and their Meaning
Gianina Iordăchioaia | Lonneke van der Plas | Glorianna Jagfeld
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

We present an interdisciplinary study on the interaction between the interpretation of noun-noun deverbal compounds (DCs; e.g., task assignment) and the morphosyntactic properties of their deverbal heads in English. Underlying hypotheses from theoretical linguistics are tested with tools and resources from computational linguistics. We start with Grimshaw’s (1990) insight that deverbal nouns are ambiguous between argument-supporting nominal (ASN) readings, which inherit verbal arguments (e.g., the assignment of the tasks), and the less verbal and more lexicalized Result Nominal and Simple Event readings (e.g., a two-page assignment). Following Grimshaw, our hypothesis is that the former will realize object arguments in DCs, while the latter will receive a wider range of interpretations like root compounds headed by non-derived nouns (e.g., chocolate box). Evidence from a large corpus assisted by machine learning techniques confirms this hypothesis, by showing that, besides other features, the realization of internal arguments by deverbal heads outside compounds (i.e., the most distinctive ASN-property in Grimshaw 1990) is a good predictor for an object interpretation of non-heads in DCs.

2015

pdf bib
From a Distance: Using Cross-lingual Word Alignments for Noun Compound Bracketing
Patrick Ziering | Lonneke van der Plas
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Predicting Pronouns across Languages with Continuous Word Spaces
Ngoc-Quan Pham | Lonneke van der Plas
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
One Tree is not Enough: Cross-lingual Accumulative Structure Transfer for Semantic Indeterminacy
Patrick Ziering | Lonneke van der Plas
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Towards a Better Semantic Role Labeling of Complex Predicates
Glorianna Jagfeld | Lonneke van der Plas
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

2014

pdf bib
What good are ‘Nominalkomposita’ for ‘noun compounds’: Multilingual Extraction and Structure Analysis of Nominal Compositions using Linguistic Restrictors
Patrick Ziering | Lonneke van der Plas
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Global Methods for Cross-lingual Semantic Role and Predicate Labelling
Lonneke van der Plas | Marianna Apidianaki | Chenhua Chen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Cross-lingual Word Sense Disambiguation for Predicate Labelling of French
Lonneke van der Plas | Marianna Apidianaki
Proceedings of TALN 2014 (Volume 1: Long Papers)

2013

pdf bib
Multilingual Lexicon Bootstrapping - Improving a Lexicon Induction System Using a Parallel Corpus
Patrick Ziering | Lonneke van der Plas | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Bootstrapping Semantic Lexicons for Technical Domains
Patrick Ziering | Lonneke van der Plas | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2011

pdf bib
Scaling up Automatic Cross-Lingual Semantic Role Annotation
Lonneke van der Plas | Paola Merlo | James Henderson
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Cross-Lingual Validity of PropBank in the Manual Annotation of French
Lonneke van der Plas | Tanja Samardžić | Paola Merlo
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Finding Medical Term Variations using Parallel Corpora and Distributional Similarity
Lonneke van der Plas | Jörg Tiedemann
Proceedings of the 6th Workshop on Ontologies and Lexical Resources

2009

pdf bib
Domain Adaptation with Artificial Data for Semantic Parsing of Speech
Lonneke van der Plas | James Henderson | Paola Merlo
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Abstraction and Generalisation in Semantic Role Labels: PropBank, VerbNet or both?
Paola Merlo | Lonneke Van Der Plas
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Combining Syntactic Co-occurrences and Nearest Neighbours in Distributional Methods to Remedy Data Sparseness.
Lonneke van der Plas
Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics

2008

pdf bib
Using Lexico-Semantic Information for Query Expansion in Passage Retrieval for Question Answering
Lonneke van der Plas | Jörg Tiedemann
Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question Answering

2006

pdf bib
Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity
Lonneke van der Plas | Jörg Tiedemann
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Automatic Acquisition of Lexico-semantic Knowledge for QA
Lonneke van der Plas | Gosse Bouma
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources

2004

pdf bib
Automatic Keyword Extraction from Spoken Text. A Comparison of Two Lexical Resources: EDR and WordNet
Lonneke van der Plas | Vincenzo Pallotta | Martin Rajman | Hatem Ghorbel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)