Menno Van Zaanen

Also published as: Menno van Zaanen, Menno van Zannen


2024

pdf bib
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Rooweither Mabuya | Muzi Matfunjwa | Mmasibidi Setaka | Menno van Zaanen
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

pdf
Adapting Nine Traditional Text Readability Measures into Sesotho
Johannes Sibeko | Menno van Zaanen
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

This article discusses the adaptation of traditional English readability measures into Sesotho, a Southern African indigenous low-resource language. We employ the use of a translated readability corpus to extract textual features from the Sesotho texts and readability levels from the English translations. We look at the correlation between the different features to ensure that non-competing features are used in the readability metrics. Next, through linear regression analyses, we examine the impact of the text features from the Sesotho texts on the overall readability levels (which are gauged from the English translations). Starting from the structure of the traditional English readability measures, linear regression models identify coefficients and intercepts for the different variables considered in the readability formulas for Sesotho. In the end, we propose ten readability formulas for Sesotho (one more than the initial nine; we provide two formulas based on the structure of the Gunning Fog index). We also introduce intercepts for the Gunning Fog index, the Läsbarhets index and the Readability index (which do not have intercepts in the English variants) in the Sesotho formulas.

2023

pdf bib
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
Rooweither Mabuya | Don Mthobela | Mmasibidi Setaka | Menno Van Zaanen
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

2022

pdf
Detecting Multiple Transitions in Literary Texts
Nuette Heyns | Menno van Zaanen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Identifying the high level structure of texts provides important information when performing distant reading analysis. The structure of texts is not necessarily linear, as transitions, such as changes in the scenery or flashbacks, can be present. As a first step in identifying this structure, we aim to identify transitions in texts. Previous work (Heyns and van Zaanen, 2021) proposed a system that can successfully identify one transition in literary texts. The text is split in snippets and LDA is applied, resulting in a sequence of topics. A transition is introduced at the point that separates the topics (before and after the point) best. In this article, we extend the existing system such that it can detect multiple transitions. Additionally, we introduce a new system that inherently handles multiple transitions in texts. The new system also relies on LDA information, but is more robust than the previous system. We apply these systems to texts with known transitions (as they are constructed by concatenating text snippets stemming from different source texts) and evaluation both systems on texts with one transition and texts with two transitions. As both systems rely on LDA to identify transitions between snippets, we also show the impact of varying the number of LDA topics on the results as well. The new system consistently outperforms the previous system, not only on texts with multiple transitions, but also on single boundary texts.

2020

pdf
A Process-oriented Dataset of Revisions during Writing
Rianne Conijn | Emily Dux Speltz | Menno van Zaanen | Luuk Van Waes | Evgeny Chukharev-Hudilainen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Revision plays a major role in writing and the analysis of writing processes. Revisions can be analyzed using a product-oriented approach (focusing on a finished product, the text that has been produced) or a process-oriented approach (focusing on the process that the writer followed to generate this product). Although several language resources exist for the product-oriented approach to revisions, there are hardly any resources available yet for an in-depth analysis of the process of revisions. Therefore, we provide an extensive dataset on revisions made during writing (accessible via https://hdl.handle.net/10411/VBDYGX). This dataset is based on keystroke data and eye tracking data of 65 students from a variety of backgrounds (undergraduate and graduate English as a first language and English as a second language students) and a variety of tasks (argumentative text and academic abstract). In total, 7,120 revisions were identified in the dataset. For each revision, 18 features have been manually annotated and 31 features have been automatically extracted. As a case study, we show two potential use cases of the dataset. In addition, future uses of the dataset are described.

pdf bib
Proceedings of the first workshop on Resources for African Indigenous Languages
Rooweither Mabuya | Phathutshedzo Ramukhadi | Mmasibidi Setaka | Valencia Wagner | Menno van Zaanen
Proceedings of the first workshop on Resources for African Indigenous Languages

2018

pdf
The Influence of Context on the Learning of Metrical Stress Systems Using Finite-State Machines
Cesko Voeten | Menno van Zaanen
Computational Linguistics, Volume 44, Issue 2 - June 2018

Languages vary in the way stress is assigned to syllables within words. This article investigates the learnability of stress systems in a wide range of languages. The stress systems can be described using finite-state automata with symbols indicating levels of stress (primary, secondary, or no stress). Finite-state automata have been the focus of research in the area of grammatical inference for some time now. It has been shown that finite-state machines are learnable from examples using state-merging. One such approach, which aims to learn k-testable languages, has been applied to stress systems with some success. The family of k-testable languages has been shown to be efficiently learnable (in polynomial time). Here, we extend this approach to k, l-local languages by taking not only left context, but also right context, into account. We consider empirical results testing the performance of our learner using various amounts of context (corresponding to varying definitions of phonological locality). Our results show that our approach of learning stress patterns using state-merging is more reliant on left context than on right context. Additionally, some stress systems fail to be learned by our learner using either the left-context k-testable or the left-and-right-context k, l-local learning system. A more complex merging strategy, and hence grammar representation, is required for these stress systems.

pdf
A Multilingual Wikified Data Set of Educational Material
Iris Hendrickx | Eirini Takoulidou | Thanasis Naskos | Katia Lida Kermanidis | Vilelmini Sosoni | Hugo de Vos | Maria Stasimioti | Menno van Zaanen | Panayota Georgakopoulou | Valia Kordoni | Maja Popovic | Markus Egg | Antal van den Bosch
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Translation Crowdsourcing: Creating a Multilingual Corpus of Online Educational Content
Vilelmini Sosoni | Katia Lida Kermanidis | Maria Stasimioti | Thanasis Naskos | Eirini Takoulidou | Menno van Zaanen | Sheila Castilho | Panayota Georgakopoulou | Valia Kordoni | Markus Egg
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Improving Machine Translation of Educational Content via Crowdsourcing
Maximiliana Behnke | Antonio Valerio Miceli Barone | Rico Sennrich | Vilelmini Sosoni | Thanasis Naskos | Eirini Takoulidou | Maria Stasimioti | Menno van Zaanen | Sheila Castilho | Federico Gaspari | Panayota Georgakopoulou | Valia Kordoni | Markus Egg | Katia Lida Kermanidis
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf
TraMOOC (Translation for Massive Open Online Courses): providing reliable MT for MOOCs
Valia Kordoni | Lexi Birch | Ioana Buliga | Kostadin Cholakov | Markus Egg | Federico Gaspari | Yota Georgakopolou | Maria Gialama | Iris Hendrickx | Mitja Jermol | Katia Kermanidis | Joss Moorkens | Davor Orlic | Michael Papadopoulos | Maja Popović | Rico Sennrich | Vilelmini Sosoni | Dimitrios Tsoumakos | Antal van den Bosch | Menno van Zaanen | Andy Way
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

2015

pdf
TraMOOC: Translation for Massive Open Online Courses
Valia Kordoni | Kostadin Cholakov | Markus Egg | Andy Way | Lexi Birch | Katia Kermanidis | Vilelmini Sosoni | Dimitrios Tsoumakos | Antal van den Bosch | Iris Hendrickx | Michael Papadopoulos | Panayota Georgakopoulou | Maria Gialama | Menno van Zaanen | Ioana Buliga | Mitja Jermol | Davor Orlic
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
TraMOOC: Translation for Massive Open Online Courses
Valia Kordoni | Kostadin Cholakov | Markus Egg | Andy Way | Lexi Birch | Katia Kermanidis | Vilelmini Sosoni | Dimitrios Tsoumakos | Antal van den Bosch | Iris Hendrickx | Michael Papadopoulos | Panayota Georgakopoulou | Maria Gialama | Menno van Zaanen | Ioana Buliga | Mitja Jermol | Davor Orlic
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf
The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis.
Menno van Zaanen | Gerhard van Huyssteen | Suzanne Aussems | Chris Emmery | Roald Eiselen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In most languages, new words can be created through the process of compounding, which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space, in languages such as Finnish, German, Afrikaans and Dutch these components are concatenated into one word. Compounding is very productive and leads to practical problems in developing machine translators and spelling checkers, as newly formed compounds cannot be found in existing lexicons. The Automatic Compound Processing (AuCoPro) project deals with the analysis of compounds in two closely-related languages, Afrikaans and Dutch. In this paper, we present the development and evaluation of two datasets, one for each language, that contain compound words with annotated compound boundaries. Such datasets can be used to train classifiers to identify the compound components in novel compounds. We describe the process of annotation and provide an overview of the annotation guidelines as well as global properties of the datasets. The inter-rater agreements between the annotators are considered highly reliable. Furthermore, we show the usability of these datasets by building an initial automatic compound boundary detection system, which assigns compound boundaries with approximately 90% accuracy.

pdf bib
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)
Ben Verhoeven | Walter Daelemans | Menno van Zaanen | Gerhard van Huyssteen
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

pdf
Automatic Compound Processing: Compound Splitting and Semantic Analysis for Afrikaans and Dutch
Ben Verhoeven | Menno van Zaanen | Walter Daelemans | Gerhard van Huyssteen
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

pdf
OpenSoNaR: user-driven development of the SoNaR corpus interfaces
Martin Reynaert | Matje van de Camp | Menno van Zaanen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

2011

pdf bib
Formal and Empirical Grammatical Inference
Jeffrey Heinz | Colin de la Higuera | Menno van Zannen
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

2009

pdf bib
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference
Menno van Zaanen | Colin de la Higuera
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

pdf bib
Grammatical Inference and Computational Linguistics
Menno van Zaanen | Colin de la Higuera
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

pdf
Language Models for Contextual Error Detection and Correction
Herman Stehouwer | Menno van Zaanen
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference

2007

pdf
Named Entity Recognition in Question Answering of Speech Data
Diego Mollá | Menno van Zaanen | Steve Cassidy
Proceedings of the Australasian Language Technology Workshop 2007

2006

pdf
Named Entity Recognition for Question Answering
Diego Mollá | Menno van Zaanen | Daniel Smith
Proceedings of the Australasian Language Technology Workshop 2006

2005

pdf
DEMOCRAT: Deciding between Multiple Outputs Created by Automatic Translation
Menno van Zaanen | Harold Somers
Proceedings of Machine Translation Summit X: Papers

pdf bib
Proceedings of the Australasian Language Technology Workshop 2005
Timothy Baldwin | James Curran | Menno van Zaanen
Proceedings of the Australasian Language Technology Workshop 2005

pdf
Learning of Graph Rules for Question Answering
Diego Molla | Menno van Zaanen
Proceedings of the Australasian Language Technology Workshop 2005

2000

pdf
ABL: Alignment-Based Learning
Menno van Zaanen
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics