Laurette Pretorius


Missed opportunities in translation memory matching
Friedel Wolff | Laurette Pretorius | Paul Buitelaar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

A translation memory system stores a data set of source-target pairs of translations. It attempts to respond to a query in the source language with a useful target text from the data set to assist a human translator. Such systems estimate the usefulness of a target text suggestion according to the similarity of its associated source text to the source text query. This study analyses two data sets in two language pairs each to find highly similar target texts, which would be useful mutual suggestions. We further investigate which of these useful suggestions can not be selected through source text similarity, and we do a thorough analysis of these cases to categorise and quantify them. This analysis provides insight into areas where the recall of translation memory systems can be improved. Specifically, source texts with an omission, and semantically very similar source texts are some of the more frequent cases with useful target text suggestions that are not selected with the baseline approach of simple edit distance between the source texts.


Base Concepts in the African Languages Compared to Upper Ontologies and the WordNet Top Ontology
Winston Anderson | Laurette Pretorius | Albert Kotzé
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Ontologies, and in particular upper ontologies, are foundational to the establishment of the Semantic Web. Upper ontologies are used as equivalence formalisms between domain specific ontologies. Multilingualism brings one of the key challenges to the development of these ontologies. Fundamental to the challenges of defining upper ontologies is the assumption that concepts are universally shared. The approach to developing linguistic ontologies aligned to upper ontologies, particularly in the non-Indo-European language families, has highlighted these challenges. Previously two approaches to developing new linguistic ontologies and the influence of these approaches on the upper ontologies have been well documented. These approaches are examined in a unique new context: the African, and in particular, the Bantu languages. In particular, we address the following two questions: Which approach is better for the alignment of the African languages to upper ontologies? Can the concepts that are linguistically shared amongst the African languages be aligned easily with upper ontology concepts claimed to be universally shared?

Work on Spoken (Multimodal) Language Corpora in South Africa
Jens Allwood | Harald Hammarström | Andries Hendrikse | Mtholeni N. Ngcobo | Nozibele Nomdebevana | Laurette Pretorius | Mac van der Merwe
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes past, ongoing and planned work on the collection and transcription of spoken language samples for all the South African official languages and as part of this the training of researchers in corpus linguistic research skills. More specifically the work has involved (and still involves) establishing an international corpus linguistic network linked to a network hub at a UNISA website and the development of research tools, a corpus research guide and workbook for multimodal communication and spoken language corpus research. As an example of the work we are doing and hope to do more of in the future, we present a small pilot study of the influence of English and Afrikaans on the 100 most frequent words in spoken Xhosa as this is evidenced in the corpus of spoken interaction we have gathered so far. Other planned work, besides work on spoken language phenomena, involves comparison of spoken and written language and work on communicative body movements (gestures) and their relation to speech.


Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge of a Disjunctive Orthography
Rigardt Pretorius | Ansu Berg | Laurette Pretorius | Biffie Viljoen
Proceedings of the First Workshop on Language Technologies for African Languages

Exploiting Cross-Linguistic Similarities in Zulu and Xhosa Computational Morphology
Laurette Pretorius | Sonja Bosch
Proceedings of the First Workshop on Language Technologies for African Languages


Experimental Fast-Tracking of Morphological Analysers for Nguni Languages
Sonja Bosch | Laurette Pretorius | Kholisa Podile | Axel Fleisch
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The development of natural language processing (NLP) components is resource-intensive and therefore justifies exploring ways of reducing development time and effort when building NLP components. This paper addresses the experimental fast-tracking of the development of finite-state morphological analysers for Xhosa, Swati and (Southern) Ndebele by using an existing morphological analyser prototype for Zulu. The research question is whether fast-tracking is feasible across the language boundaries between these closely related varieties. The objective is a thorough assessment of recognition rates yielded by the Zulu morphological analyser for the three related languages. The strategy is to use techniques comprising several cycles of the following steps: applying the analyser to corpus data from all languages, identifying failures, and implementing the respective changes in the analyser. Tests show that the high degree of shared typological properties and formal similarities among the Nguni varieties warrants a modular fast-tracking approach. Word forms recognized by the Zulu analyser were mostly adequately interpreted. Therefore, the focus lies on providing adaptations based on failure output analysis for each language. As a result, the development of analysers for Xhosa, Swati and Ndebele is considerably faster than the creation of the Zulu prototype. The paper concludes with comments on the feasibility of the experiment, and the results of the evaluation.


Towards machine-readable lexicons for South African Bantu languages
Sonja E. Bosch | Laurette Pretorius | Jackie Jones
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Lexical information for South African Bantu languages is not readily available in the form of machine-readable lexicons. At present the availability of lexical information is restricted to a variety of paper dictionaries. These dictionaries display considerable diversity in the organisation and representation of data. In order to proceed towards the development of reusable and suitably standardised machine-readable lexicons for these languages, a data model for lexical entries becomes a prerequisite. In this study the general purpose model as developed by Bell & Bird (2000) is used as a point of departure. Firstly, the extent to which the Bell & Bird (2000) data model may be applied to and modified for the above-mentioned languages is investigated. Initial investigations indicate that modification of this data model is necessary to make provision for the specific requirements of lexical entries in these languages. Secondly, a data model in the form of an XML DTD for the languages in question, based on our findings regarding (Bell & Bird, 2000) and (Weber, 2002) is presented. Included in this model are additional particular requirements for complete and appropriate representation of linguistic information as identified in the study of available paper dictionnaries.


Software Tools for Morphological Tagging of Zulu Corpora and Lexicon Development
Sonja E. Bosch | Laurette Pretorius
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The aim of this paper is to discuss aspects of an on-going project on the development of grammatical and lexical resources for Zulu with sufficient coverage for unrestricted text. We explain how the basic software tools of computational morphology are used in linguistic processing, more specifically for automatic word form recognition and morphological tagging of the growing stock of electronic text corpora of a Bantu language such as Zulu. It is also shown how a machine-readable lexicon is in turn enhanced with the information acquired and extracted by means of such corpus analysis.