Evaluating automatic cross-domain Dutch semantic role annotation
Orphée De Clercq
Veronique Hoste
Paola Monachesi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In this paper we present the first corpus where one million Dutch words from a variety of text genres have been annotated with semantic roles. 500K have been completely manually verified and used as training material to automatically label another 500K. All data has been annotated following an adapted version of the PropBank guidelines. The corpus's rich text type diversity and the availability of manually verified syntactic dependency structures allowed us to experiment with an existing semantic role labeler for Dutch. In order to test the system's portability across various domains, we experimented with training on individual domains and compared this with training on multiple domains by adding more data. Our results show that training on large data sets is necessary but that including genre-specific training material is also crucial to optimize classification. We observed that a small amount of in-domain training data is already sufficient to improve our semantic role labeler.
An Examination of Cross-Cultural Similarities and Differences from Social Media Data with respect to Language Use
Mohammad Fazleh Elahi
Paola Monachesi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We present a methodology for analyzing cross-cultural similarities and differences using language as a medium, love as domain, social media as a data source and 'Terms' and 'Topics' as cultural features. We discuss the techniques necessary for the creation of the social data corpus from which emotion terms have been extracted using NLP techniques. Topics of love discussion were then extracted from the corpus by means of Latent Dirichlet Allocation (LDA). Finally, on the basis of these features, a cross-cultural comparison was carried out. For the purpose of cross-cultural analysis, the experimental focus was on comparing data from a culture from the East (India) with a culture from the West (United States of America). Similarities and differences between these cultures have been analyzed with respect to the usage of emotions, their intensities and the topics used during love discussion in social media.
Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch
Ineke Schuurman
Véronique Hoste
Paola Monachesi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper reports on the annotation of a corpus of 1 million words with four semantic annotation layers, including named entities, co- reference relations, semantic roles and spatial and temporal expressions. These semantic annotation layers can benefit from the manually verified part of speech tagging, lemmatization and syntactic analysis (dependency tree) information layers which resulted from an earlier project (Van Noord et al., 2006) and will thus result in a deeply syntactically and semantically annotated corpus. This annotation effort is carried out in the framework of a larger project which aims at the collection of a 500-million word corpus of contemporary Dutch, covering the variants used in the Netherlands and Flanders, the Dutch speaking part of Belgium. All the annotation schemes used were (co-)developed by the authors within the Flemish-Dutch STEVIN-programme as no previous schemes for Dutch were available. They were created taking into account standards (either de facto or official (like ISO)) used elsewhere.
Socially Driven Ontology Enrichment for eLearning
Paola Monachesi
Thomas Markus
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
One of the objectives of the Language Technologies for Life-Long Learning (LTfLL) project, is to develop a knowledge sharing system that connects learners to resources and learners to other learners. To this end, we complement the formal knowledge represented by existing domain ontologies with the informal knowledge emerging from social tagging. More specifically, we crawl data from social media applications such as Delicious, Slideshare and YouTube. Similarity measures are employed to select possible lexicalizations of concepts that are related to the ones present in the given ontology and which are assumed to be socially relevant with respect to the input lexicalisation. In order to identify the appropriate relationships which exist between the extracted related terms and the existing domain ontology, we employ several heuristics that rely on the use of a large background knowledge base, such as DBpedia. An evaluation of the resulting ontology has been carried out. The methodology proposed allows for an appropriate enrichment process and produces a complementary vocabulary to that of a domain expert.
Ontology Engineering and Knowledge Extraction for Cross-Lingual Retrieval
Jantine Trapman
Paola Monachesi
Proceedings of the International Conference RANLP-2009
From D-Coi to SoNaR: a reference corpus for Dutch
Nelleke Oostdijk
Martin Reynaert
Paola Monachesi
Gertjan Van Noord
Roeland Ordelman
Ineke Schuurman
Vincent Vandeghinste
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established. The Dutch Corpus Initiative project or D-Coi was highly successful in that it not only realized about 10% of the projected large reference corpus, but also established the best practices and developed all the protocols and the necessary tools for building the larger corpus within the confines of a necessarily limited budget. We outline the steps involved in an endeavour of this kind, including the major highlights and possible pitfalls. Once converted to a suitable XML format, further linguistic annotation based on the state-of-the-art tools developed either before or during the pilot by the consortium partners proved easily and fruitfully applicable. Linguistic enrichment of the corpus includes PoS tagging, syntactic parsing and semantic annotation, involving both semantic role labeling and spatiotemporal annotation. D-Coi is expected to be followed by SoNaR, during which the 500-million-word reference corpus of Dutch should be built.
Extraction and Evaluation of Keywords from Learning Objects: a Multilingual Approach
Lothar Lemnitzer
Paola Monachesi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We report about a project which brings together Natural Language Processing and eLearning. One of the functionalities developed within this project is the possibility to annotate learning objects semi-automatically with keywords. To this end, a keyword extractor has been created which is able to handle documents in 8 languages. The approach employed is based on a linguistic processing step which is followed by a filtering step of candidate keywords and their subsequent ranking based on frequency criteria. Three tests have been carried out to provide a rough evaluation of the performance of the tool, to measure inter annotator agreement in order to determine the complexity of the task and to evaluate the acceptance of the proposed keywords by users.
Creating Glossaries Using Pattern-Based and Machine Learning Techniques
Eline Westerhout
Paola Monachesi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
One of the aims of the Language Technology for eLearning project is to show that Natural Language Processing techniques can be employed to enhance the learning process. To this end, one of the functionalities that has been developed is a pattern-based glossary candidate detector which is capable of extracting definitions in eight languages. In order to improve the results obtained with the pattern-based approach, machine learning techniques are applied on the Dutch results to filter out incorrectly extracted definitions. In this paper, we discuss the machine learning techniques used and we present the results of the quantitative evaluation. We also discuss the integration of the tool into the Learning Management System ILIAS.
Adding Semantic Role Annotation to a Corpus of Written Dutch
Paola Monachesi
Gerwert Stevens
Jantine Trapman
Proceedings of the Linguistic Annotation Workshop
A pilot study for a Corpus of Dutch Aphasic Speech (CoDAS)
Eline Westerhout
Paola Monachesi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper, a pilot study for the development of a corpus of Dutch Aphasic Speech (CoDAS) is presented. Given the lack of resources of this kind not only for Dutch but also for other languages, CoDAS will be able to set standards and will contribute to the future research in this area. Given the special character of the speech contained in CoDAS, we cannot simply carry over the design and annotation protocols of existing corpora, such as the Corpus Gesproken Nederlands or CHILDES. However, they have been assumed as starting point. We have investigated whether and how the procedures and protocols for the annotation (part-of-speech tagging) and transcription (orthographic and phonetic) used for the CGN should be adapted in order to annotate and transcribe aphasic speech properly. Besides, we have established the basic requirements with respect to text types, metadata, and annotation levels that CoDAS should fulfill.
A unified system for accessing typological databases
Paola Monachesi
Alexis Dimitriadis
Rob Goedemans
Anne-Marie Mineur
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
Object clitics and clitic climbing in Italian HPSG grammar
Paola Monachesi
Sixth Conference of the European Chapter of the Association for Computational Linguistics