Roald Eiselen


2018

pdf bib
NLP Web Services for Resource-Scarce Languages
Martin Puttkammer | Roald Eiselen | Justin Hocking | Frederik Koen
Proceedings of ACL 2018, System Demonstrations

In this paper, we present a project where existing text-based core technologies were ported to Java-based web services from various architectures. These technologies were developed over a period of eight years through various government funded projects for 10 resource-scarce languages spoken in South Africa. We describe the API and a simple web front-end capable of completing various predefined tasks.

2016

pdf bib
South African Language Resources: Phrase Chunking
Roald Eiselen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Phrase chunking remains an important natural language processing (NLP) technique for intermediate syntactic processing. This paper describes the development of protocols, annotated phrase chunking data sets and automatic phrase chunkers for ten South African languages. Various problems with adapting the existing annotation protocols of English are discussed as well as an overview of the annotated data sets. Based on the annotated sets, CRF-based phrase chunkers are created and tested with a combination of different features, including part of speech tags and character n-grams. The results of the phrase chunking evaluation show that disjunctively written languages can achieve notably better results for phrase chunking with a limited data set than conjunctive languages, but that the addition of character n-grams improve the results for conjunctive languages.

pdf bib
Government Domain Named Entity Recognition for South African Languages
Roald Eiselen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes the named entity language resources developed as part of a development project for the South African languages. The development efforts focused on creating protocols and annotated data sets with at least 15,000 annotated named entity tokens for ten of the official South African languages. The description of the protocols and annotated data sets provide an overview of the problems encountered during the annotation of the data sets. Based on these annotated data sets, CRF named entity recognition systems are developed that leverage existing linguistic resources. The newly created named entity recognisers are evaluated, with F-scores of between 0.64 and 0.77, and error analysis is performed to identify possible avenues for improving the quality of the systems.

2014

pdf bib
Developing Text Resources for Ten South African Languages
Roald Eiselen | Martin Puttkammer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The development of linguistic resources for use in natural language processing is of utmost importance for the continued growth of research and development in the field, especially for resource-scarce languages. In this paper we describe the process and challenges of simultaneously developing multiple linguistic resources for ten of the official languages of South Africa. The project focussed on establishing a set of foundational resources that can foster further development of both resources and technologies for the NLP industry in South Africa. The development efforts during the project included creating monolingual unannotated corpora, of which a subset of the corpora for each language was annotated on token, orthographic, morphological and morphosyntactic layers. The annotated subsets includes both development and test sets and were used in the creation of five core-technologies, viz. a tokeniser, sentenciser, lemmatiser, part of speech tagger and morphological decomposer for each language. We report on the quality of these tools for each language and discuss the importance of the resources within the South African context.

pdf bib
The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis.
Menno van Zaanen | Gerhard van Huyssteen | Suzanne Aussems | Chris Emmery | Roald Eiselen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In most languages, new words can be created through the process of compounding, which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space, in languages such as Finnish, German, Afrikaans and Dutch these components are concatenated into one word. Compounding is very productive and leads to practical problems in developing machine translators and spelling checkers, as newly formed compounds cannot be found in existing lexicons. The Automatic Compound Processing (AuCoPro) project deals with the analysis of compounds in two closely-related languages, Afrikaans and Dutch. In this paper, we present the development and evaluation of two datasets, one for each language, that contain compound words with annotated compound boundaries. Such datasets can be used to train classifiers to identify the compound components in novel compounds. We describe the process of annotation and provide an overview of the annotation guidelines as well as global properties of the datasets. The inter-rater agreements between the annotators are considered highly reliable. Furthermore, we show the usability of these datasets by building an initial automatic compound boundary detection system, which assigns compound boundaries with approximately 90% accuracy.