This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we generate only three BibTeX files per volume, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
This contribution describes a free and open mobile dictionary app based on open dictionary data. A specific focus is on usability and user-adequate presentation of data. This includes, in addition to the alphabetical lemma ordering, other vocabulary selection, grouping, and access criteria. Beyond search functionality for stems or roots – required due to the morphological complexity of Bantu languages – grouping of lemmas by subject area of varying difficulty allows customization. A dictionary profile defines available presentation options of the dictionary data in the app and can be specified according to the needs of the respective user group. Word embeddings and similar approaches are used to link to semantically similar or related words. The underlying data structure is open for monolingual, bilingual or multilingual dictionaries and also supports the connection to complex external resources like Wordnets. The application in its current state focuses on Xhosa and Zulu dictionary data but more resources will be integrated soon.
Creating a new wordnet is by no means a trivial task and when the target language is under-resourced as is the case for the languages currently included in the multilingual African Wordnet (AfWN), developers need to rely heavily on human expertise. During the different phases of development of the AfWN, we incorporated various methods of fast-tracking to ease the tedious and time-consuming work. Some methods have proven effective while others seem to have little positive impact on the work rate. As in the case of many other under-resourced languages, the expand model was implemented throughout, thus depending on English source data such as the English Princeton Wordnet (PWN) which is then translated into the target language with the assumption that the new language shares an underlying structure with the PWN. The paper discusses some problems encountered along the way and points out various possibilities of (semi) automated quality assurance measures and further refinement of the AfWN to ensure accelerated growth. In this paper we aim to highlight some of the lessons learnt from hands-on experience in order to facilitate similar projects, in particular for languages from other African countries.
The African Wordnet Project (AWN) includes all nine indigenous South African languages, namely isiZulu, isiXhosa, Setswana, Sesotho sa Leboa, Tshivenda, Siswati, Sesotho, isiNdebele and Xitsonga. The AWN currently includes 61 000 synsets as well as definitions and usage examples for a large part of the synsets. The project recently received extended funding from the South African Centre for Digital Language Resources (SADiLaR) and aims to update all aspects of the current resource, including the seed list used for new development, software tools used and mapping the AWN to the latest version of PWN 3.1. As with any resource development project, it is essential to also include phases of focused quality assurance and updating of the basis on which the resource is built. The African languages remain under-resourced. This paper describes progress made in the development of the AWN as well as recent technical improvements.
The development of the African Wordnet (AWN) has reached a stage of maturity where the first steps towards an application can be attempted. The AWN is based on the expand method, and to compensate for the general resource scarceness of the African languages, various development strategies were used. The aim of this paper is to investigate the usefulness of the current isiZulu Wordnet in an application such as language learning. The advantage of incorporating the wordnet of a language into a language learning system is that it provides learners with an integrated application to enhance their learning experience by means of the unique sense identification features of wordnets. In this paper it will be demonstrated by means of a variety of examples within the context of a basic free online course how the isiZulu Wordnet can offer the language learner improved decision support.
In this paper we discuss noun compounding, a highly generative, productive process, in three distinct languages: Czech, English and Zulu. Derivational morphology presents a large grey area between regular, compositional and idiosyncratic, non-compositional word forms. The structural properties of compounds in each of the languages are reviewed and contrasted. Whereas English compounds are head-final and thus left-branching, Czech and Zulu compounds usually consist of a leftmost governing head and a rightmost dependent element. Semantic properties of compounds are discussed with special reference to semantic relations between compound members which cross-linguistically show universal patterns, but idiosyncratic, language specific compounds are also identified. The integration of compounds into lexical resources, and WordNets in particular, remains a challenge that needs to be considered in terms of the compounds syntactic idiosyncrasy and semantic compositionality. Experiments with processing compounds in Czech, English and Zulu are reported and partly evaluated. The obtained partial lists of the Czech, English and Zulu compounds are also described.
The development of natural language processing (NLP) components is resource-intensive and therefore justifies exploring ways of reducing development time and effort when building NLP components. This paper addresses the experimental fast-tracking of the development of finite-state morphological analysers for Xhosa, Swati and (Southern) Ndebele by using an existing morphological analyser prototype for Zulu. The research question is whether fast-tracking is feasible across the language boundaries between these closely related varieties. The objective is a thorough assessment of recognition rates yielded by the Zulu morphological analyser for the three related languages. The strategy is to use techniques comprising several cycles of the following steps: applying the analyser to corpus data from all languages, identifying failures, and implementing the respective changes in the analyser. Tests show that the high degree of shared typological properties and formal similarities among the Nguni varieties warrants a modular fast-tracking approach. Word forms recognized by the Zulu analyser were mostly adequately interpreted. Therefore, the focus lies on providing adaptations based on failure output analysis for each language. As a result, the development of analysers for Xhosa, Swati and Ndebele is considerably faster than the creation of the Zulu prototype. The paper concludes with comments on the feasibility of the experiment, and the results of the evaluation.
Lexical information for South African Bantu languages is not readily available in the form of machine-readable lexicons. At present the availability of lexical information is restricted to a variety of paper dictionaries. These dictionaries display considerable diversity in the organisation and representation of data. In order to proceed towards the development of reusable and suitably standardised machine-readable lexicons for these languages, a data model for lexical entries becomes a prerequisite. In this study the general purpose model as developed by Bell & Bird (2000) is used as a point of departure. Firstly, the extent to which the Bell & Bird (2000) data model may be applied to and modified for the above-mentioned languages is investigated. Initial investigations indicate that modification of this data model is necessary to make provision for the specific requirements of lexical entries in these languages. Secondly, a data model in the form of an XML DTD for the languages in question, based on our findings regarding (Bell & Bird, 2000) and (Weber, 2002) is presented. Included in this model are additional particular requirements for complete and appropriate representation of linguistic information as identified in the study of available paper dictionnaries.
The aim of this paper is to discuss aspects of an on-going project on the development of grammatical and lexical resources for Zulu with sufficient coverage for unrestricted text. We explain how the basic software tools of computational morphology are used in linguistic processing, more specifically for automatic word form recognition and morphological tagging of the growing stock of electronic text corpora of a Bantu language such as Zulu. It is also shown how a machine-readable lexicon is in turn enhanced with the information acquired and extracted by means of such corpus analysis.