International Conference on Language Resources and Evaluation (2004)


up

bib (full) Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

A word class often neglected in the field of NLP resources, namely adverbs, has lately been described in a computational lexicon produced at CST as one of the results of a Ph.D.-project. The adverb lexicon, which is integrated in the Danish STO lexicon, gives detailed syntactic information on the type of modification and position, as well as on other syntactic properties of approx 800 Danish adverbs. One of the aims of the lexicon has been to establish a clear distinction between syntactic and semantic information - where other lexicons often generalize over the syntactic behavior of semantic classes of adverbs, every adverb is described with respect to its proper syntactic behavior in a text corpus, revealing very individual syntactic properties. Syntactic information on adverbs is needed in NLP systems generating text to ensure correct placing in the phrase they modify. Also in systems analyzing text, this information is needed in order to attach the adverbs to the right node in the syntactic parse trees. Within the field of linguistic research, several results can be deduced from the lexicon, e.g. knowledge of syntactic classes of Danish adverbs.
The objective of the Cross-Language Evaluation Forum (CLEF) is to promote research in the multilingual information access domain. In this short paper, we list the achievements of CLEF during its first four years of activity and describe how the range of tasks has been considerably expanded during this period. The aim of the paper is to demonstrate the importance of evaluation initiatives with respect to system research and development and to show how essential it is for such initiatives to keep abreast of and even anticipate the emerging needs of both system developers and application communities if they are to have a future.
The SALA II project comprises mobile telephone recordings according to the SpeechDat (II) paradigm for several languages in North and Latin America. Each database contains the recordings of 1000 speakers, with the exception of US Spanish (2000 speakers) and US English (4000 speakers). A quarter of the recordings of each database are made respectively in a quiet environment (home/office), in the street, in a public place, and in a moving vehicle. This paper presents an evaluation of the project. The paper details on experiences with respect to the implementation of design specifications, speaker recruitment, data recordings (on site), data processing, orthographic transcription and lexicon generation. Furthermore, the validation procedure and its results are documented. Finally, the availability and distribution of the databases are addressed.
In this paper, we present the methodology developed by the SLI (Computational Linguistics Group of the University of Vigo) for the building and processing of the CLUVI Corpus, showing the TMX-based XML specification designed to encode both morphosyntactic features and translation alignments in parallel corpora, and the solutions adopted for making the CLUVI parallel corpora freely available over the WWW (http://sli.uvigo.es/CLUVI/).
We have developed a toolkit in which an annotation tool, a syntactic tree editor, and an extraction rule editor interact dynamically. Its output can be stored in a database for further use. In the field of biomedicine, there is a critical need for automatic text processing. However, current language processing approaches suffer from insufficient basic data incorporating both human domain expertise and domain-specific language processing capabilities. With the annotation tool presented here, a set of ggold standardsh can be collected, representing what should be extracted. At the same time, any change in annotation can be viewed on an associated syntactic tree. These facilities provide a clear picture of the relationship between the extraction target and the syntactic tree. Underlying sentences can be analyzed with a parser which can be plugged in, or a set of parsed sentences can be used to generate the tree. Extraction rules written with the integrated editor can be applied at once, and their validity can immediately be verified both on the syntactic tree and on the sentence string by coloring the corresponding segments. Thus our toolkit enables the user to efficiently construct parse-based extraction rules. PBIE2 works under Windows 2000/XP and requires Microsoft Internet Explorer 6.0 or higher. The data can be stored in Microsoft Access.
This paper describes the creation of a syntactically annotated Tibetan corpus. This corpus forms a part of the TUSNELDA collection of corpora and databases for linguistic research. It will ultimately comprise spoken and written Tibetan texts originating from different regions and historical epochs. These texts are annotated with several kinds of linguistic information, in particular POS tags, phrases, argument structures of verbs, clauses and sentences, as well as several kinds of discourse units and textual segments. The annotation is done in XML. The primary research interest which guides the development of the corpus is the investigation of cross-clausal references, especially the relation between empty arguments (i.e. arguments not overtly realised in a clause) and their antecedents in previous clauses. For this purpose, such references are explicitly encoded so that they can be qualitatively and quantitatively evaluated with the help of standard XML techniques such as XPath search and XSLT transformations. Apart from this primary research interest, we expect that our corpus will be useful for other projects concerning Tibetan and related languages. Like other data in TUSNELDA, it will be made accessible via a WWW query interface.
We report on the development and employment of lexical entry templates in a large--coverage unification--based grammar of Spanish. The aim of the work reported in this paper is to provide robust deep linguistic processing in order to make the grammar more adequate for industrial NLP applications.
In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXT-EAST compliant lexical tags (MSD) into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding. The algorithm is described in details and the generated baseline tagsets for Czech, English, Estonian, Hungarian, Romanian and Slovenean are evaluated. They are much smaller and systematically ensures better tagging accuracy than the corresponding MSDs.
The paper describes the methodology and the tools we developed for the purpose of building a Romanian wordnet. The work is carried out within the BalkaNet European project and is concerned with wordnets for Bulgarian, Czech, Greek, Romanian, Serbian and Turkish all of them aligned via an interlingual index (ILI) to Princeton Wordnet. The wordnets structuring follows the principles adopted in EuroWordNet. In order to ensure maximal cross-lingual lexical coverage, the consortium decided to implement the same concepts, represented by a common set of ILI concepts. We describe the selection of concepts to be implemented in all the monolingual wordnets The methodologies adopted by each partner were different and they depended on the language resources and personnel available. For the Romanian wordnet,we decided that it should be based on the reference lexicographic descriptions of Romanian which we had in electronic forms: EXPD, a heavily XML annotated explanatory dictionary (developed in the previous CONCEDE project and based on the standard Explanatory Dictionary of Romanian), SYND, a published dictionary of synonyms which we keyboarded, encoded and completed with more than 4000 new synonymy sets extracted from EXPD, EnRoD, a Romanian-English dictionary, most part of it being extracted automatically from parallel corpora and further hand validated and extended. Besides these monolingual resources, as all the other members of the consortium, we had at our disposal the interlingual mapping of the Princeton Wordnet. All the above mentioned resources have been incorporated into a user-friendly system, WnBuilder, which allows for cooperative work of a large number of lexicographers. When the distributed work is put together, the synsets are validated. Several errors show up, the most frequent and difficult to solve being the case of a literal with the same sense number appearing in different synsets. We discuss reasons for such conflicts as well as their correction, supported by another utility program called WnCorrector. The full paper presents WnBuilder and WnCorrector, as well as the status of the Romanian wordnet development.
Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for Arabic language). At the end, aligned equivalent verbs are detected automatically from a parallel corpus Spanish-Arabic. To achieve this goal, we had to pass through different preparatory stages concerning the assesment of the parallel corpus, the monolingual tokenization of each corpus, a preliminary sentence alignment and finally applying the model of automatic extraction of equivalent verbs. Our method is hybrid, since it combines both statistical and linguistic approaches.
This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semi-automatic terminology extraction tool based on XML, for its use in technical and scientific information managing.
For the present work, we introduce and evaluate a novel Bayesian syntactic shallow parser that is able to perform robust detection of pairs of subject-object and subject-direct object-indirect object for a given verb, in a natural language sentence. The shallow parser infers on the correct subject-object pairs based on knowledge provided by Bayesian network learning from annotated text corpora. The DELOS corpus, a collection of economic domain texts that has been automatically annotated using various morphological and syntactic tools was used as training material. Our shallow parser makes use of limited linguistic input. More specifically, we consider only part of speech tagging, the voice and the mood of the verb as well as the head word of a noun phrase. For the task of detecting the head word of a phrase we used a sentence boundary detector. Identifying the head word of a noun phrase, i.e. the word that holds the morphological information (case, number) of the whole phrase, also proves to be very helpful for our task as its morphological tag is all the information that is needed regarding the phrase. The evaluation of the proposed method was performed against three other machine learning techniques, namely naive Bayes, k-Nearest Neighbor and Support Vector Machines, methods that have been previously applied to natural language processing tasks with satisfactory results. The experimental outcomes portray a satisfactory performance of our proposed shallow parser, which reaches almost 92 per cent in terms of precision.
This paper presents some aspects of the first Portuguese frequency lexicon extracted from a corpus of large dimensions. The Multifunctional Computational Lexicon of Contemporary Portuguese (henceforth MCL) rised from the necessity of filling a gap existent in the studies of the contemporary Portuguese. Until recently, the frequency lexicons of Portuguese were of very small dimensions, such as Português Fundamental, which is constituted by 2.217 words extracted from a 700.000 word corpus and the Frequency Dictionary of Portuguese Words based on a literary corpus of 500.000 words. We describe here the main steps taken for collecting the lexical and frequency data and some of the major problems that arouse in the process. The resulting lexicon is a freely available reliable resource for several types of applications.
In the development of annotations for a spoken database, an important issue is whether the annotations can be generated automatically with sufficient precision, or whether expensive manual annotations are needed. In this paper, the case of prosodic annotations is discussed, which was investigated on the CGN database (Spoken Dutch Corpus). The main conclusions of this work are as follows. First, it was found that the available amount of manual prosodic annotations is sufficient for the development of our (baseline, decision tree based) prosodic models. In other words, more manual annotations do not improve the models. Second, the developed prosodic models for prominence are insufficiently accurate to produce automatic prominence annotations that are as good as the manual ones. But on the other hand the consistency between manual and automatic break annotations is as high as the inter-transcriber consistency for breaks. So given the current amount of manual break annotations, annotations for the remainder of the CGN database can be generated automatically with the same quality as the manual annotations.
Official travel warnings published regularly in the internet by the ministries for foreign affairs of France, Germany, and the UK provide a useful resource for assessing the risks associated with travelling to some countries. The shallow IE system SProUT has been extended to meet the specific needs of delivering a language-neutral output for English, French, or German input texts. A shared type hierarchy, a feature-enhanced gazetteer resource, and generic techniques of merging chunk analyses into larger results are major reusable results of this work.
People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor's gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments.
The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and spoken material and they yield low cost resources for acoustic model training. When automatically transcribing the audio data, regions of agreement between automatic transcripts and written sources allow to transfer time-codes to the written documents: this may be helpful in an audio archive or audio information retrieval environment. Regions of disagreement can be automatically selected for further correction by human transcribers. This study makes use of 10 hours of French radio interview archives with corresponding press-oriented transcripts. The audio corpus has then been transcribed using the LIMSI speech recognizer resulting in automatic transcripts, exhibiting an average word error rate of 12%. 80% of the text corpus (with word chunks of at least five words) can be exactly aligned with the automatic transcripts of the audio data. The residual word error rate on these 80% is less than 1%.
In this paper we present the evaluation of a spoken phonetic corpus designed to train acoustic models for Speech Recognition applications in Basque Language. A complete set of acoustic-phonetic decoding experiments was carried out over the proposed database. Context dependent and independent phoneme units were used in these experiments with two different approaches to acoustic modeling, namely discrete and continuous Hidden Markov Models (HMMs). A complete set of HMMs were trained and tested with the database. Experimental results reveal that the database is large and phonetically rich enough to get great acoustic models to be integrated in Continuous Speech Recognition Systems.
We inspect the possibility of creating new linguistic utterances (small sentences) similar to those already present in an existing linguistic resource. Using paradigm tables ensures that the new generated sentences resemble previous data, while being of course different. We report an experiment in which 1,201 new correct sentences were generated starting from only 22 seed sentences.
This paper describes the infrastructure of a basic language resources set for Bulgarian in the context of BLARK initiative requirements. We focus on the treebanking task as a trigger for basic language resources compilation. Two strategies have been applied in this respect: (1) implementing the main pre-processing modules before the treebank compilation and (2) creating more elaborate types of resources in parallel to the treebank compilation. The description of language resources within BulTreeBank project is divided into two parts: language technology, which includes tokenization, morphosyntactic analyzer, morphosyntactic disambiguation, partial grammars, and language data, which includes the layers of the BulTreeBank corpus and the variety of lexicons. The advantages of our approach to a less-spoken language (like Bulgarian) are as follows: it triggers the creation of the basic set of language resources which lack for certain languages and it rises the question about the ways of language resources creation.
This paper deals with databases that combine different aspects: children's speech, emotional speech, human-robot communication, cross-linguistics, and read vs. spontaneous speech: in a Wizard-of-Oz scenario, German and English children had to instruct Sony's AIBO robot to fulfil specific tasks. In one experimental condition, strictly parallel for German and English, the AIBO behaved `disobedient' by following it's own script irrespective of the child's commands. By that, reactions of different children to the same sequence of AIBO's actions could be obtained. In addition, both the German and the English children were recorded reading texts. The data are transliterated orthographically; emotional user states and some other phenomena will be annotated. We report preliminary word recognition rates and classification results.
One of the major obstacles for knowledge management remains MultiWord Terminology (MWT). This paper explores the difficulties that arise and describes real world solutions implemented as part of the Parmenides project. Parmenides is being built as an integrated knowledge management package that combines information, MWT and ontology extraction methods in a semi-automated framework. The focus of this paper is on eliciting ontological fragments based on dedicated MWT processing.
The OPUS corpus is a growing collection of translated documents collected from the internet. The current version contains about 30 million words in 60 languages. The entire corpus is sentence aligned and it also contains linguistic markup for certain languages.
This work tries to enrich the Spanish Wordnet using a Spanish taxonomy as a knowledge source. The Spanish taxonomy is composed by Spanish senses, while Spanish Wordnet is composed by synsets, mostly linked to English WordNet. A set of weighted associations between Spanish words and Wordnet synsets is used for inferring associations between both taxonomies.
The goal of this project (LILA) is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems for telephone applications in the Asian Pacific area. Specifications follow those of SpeechDat-like databases. Utterances will be recorded directly from calls made either from fixed or cellular telephones and are composed by read text and answers to specific questions. The project is driven by a consortium composed by a large number of industrial companies. Each company is in charge of the production of two databases. The consortium shares the databases produced in the project. The goal of the project should be reached within the year 2005.
When a text in any language is submitted to a morphological analysis, there always rest some unrecognized words. We can lower their number by adding new words into the dictionary used by the morphological analyzer but we can never gather the whole of the language. The system described in this paper (we call it "derivation module") deals with the unknown derived words. It aims not only at analyzing but also at synthesizing Czech derived words. Such a system is of particular value for automatic processing of languages where derivational morphology plays an important role in regular word formation.
This paper deals with the quality evaluation (validation) of Spoken Language Resources (SLR). The current situation in terms of relevant validation criteria and procedures is briefly presented. Next, a number of validation issues related to new data formats (XML-based annotations, UTF-16 encoding) are discussed. Further, new validation cycles that were introduced in a series of new projects like SpeeCon and OrienTel are addressed: prompt sheet validation, lexicon validation and pre-release validation. Finally, SPEX's current and future
We present a first approach to the application of a data mining technique, Multiple Sequence Alignment, to the systematization of a polemic aspect of discourse, namely, the expression of contrast, concession, counterargument and semantically similar discursive relations. The representation of the phenomena under study is carried out by very simple techniques, mostly pattern-matching, but the results allow to drive insightful conclusions on the organization of this aspect of discourse: equivalence classes of discourse markers are established, and systematic patterns are discovered, which will be applied in enhancing a discursive parser.
This paper proposes a methodology for obtaining sentences containing discourse markers from the World Wide Web. The proposed methodology is particularly suitable for collecting large numbers of discourse marker tokens. It relies on the automatic identification of discourse markers, and we show that this can be done with an accuracy within 9% of that of human performance. We also show that the distribution of discourse markers on the web correlates highly with those in a conventional balanced corpus.
In this paper an interactive pattern extraction workbench, I*Pex, is presented. The workbench comes in a graphical environment and is designed to be used in an incremental and interactive fashion with the user. Patterns can be constructed to work in combination involving specifications on several linguistic levels simultaneously, from the character level using regular expressions, parts of speech and dependency relations to semantic roles. The input text format is based on XCES XML format.
The paper describes an annotation scheme for coreference developed within the application context of text-to-hypertext conversion. In this context coference is used (1) for generating document-internal and cross-document hyperlinks, and (2) for resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes. We will argue that for the purpose of cross-document linking it is necessary to separate the annotation of coreference relations from the annotation of anaphoric relations. To account for this requirement, we developed a knowledge-based annotation scheme that relates referential expressions in the text to entities in a knowledge representation, which is modeled using XML Topic Maps.
In the automatic summarisation of written texts, direct speech is usually deemed unsuitable for inclusion in important sentences. This is due to the fact that humans do not usually include such quotations when they create summaries. In this paper, we argue that despite generally negative attitudes, direct speech can be useful for summarisation and ignoring it can result in the omission of important and relevant information. We present an analysis of a corpus of annotated newswire texts in which a substantial amount of speech is marked by different annotators, and describe when and why direct speech can be included in summaries. In an attempt to make direct speech more appropriate for summaries, we also describe rules currently being developed to transform it into a more summary-acceptable format.
This paper describes how Human Language Technologies and linguistic resources are used to support the construction of components of a knowledge organisation system. In particular we focus on methodologies and resources for building a corpus-based domain ontology and extracting relevant metadata information for text chunks from domain-specific corpora.
In the paper we describe development, means of evaluation and applications of Russian-English Sociopolitical Thesaurus specially developed as a linguistic resource for automatic text processing applications. The Sociopolitical domain is not a domain of social research but a broad domain of social relations including economic, political, military, cultural, sports and other subdomains. The knowledge of this domain is necessary for automatic text processing of such important documents as official documents, legislative acts, newspaper articles.
In the paper we describe our approach to development of ontologies with small number of relation types. Non-taxonomic relations in our ontologies are based on ontological dependence conception described in the formal ontology. This minimal relations set does not depend on a domain or a task and makes possible to begin the ontology construction at once, as soon as a task is set and a domain is determined, to receive the first version of an ontology in short time. Such an initial ontology can be used for information-retrieval applications and can serve as a structural basis for further development of the ontology
Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL's webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese). The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL's projects and now available for on-line queries; b) a published sample of "Português Fundamental", a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portuguese are also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.
This article describes the use and development of a tool for grammar and terminology control (FLAG), for the purposes of automating the verification of terminology for a large-scale user of multilingual terminology. It describes the various advantages of the tool and shows a process for transforming a traditional terminology list into a list of inflected forms as well as patterns which can be used to find possible morpho-syntactic derivations of terms.
Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction task is clearly domain depending, in the case a general-purpose resource without specific domain information is used, we need a way of attaching domain information to the units of the resource. For big resources it is desirable that this semantic enrichment could be carried out automatically. Given a specific domain, our proposal aims to detect in EWN those units that can be considered as domain markers (DM). We can define a DM as an EWN entry whose attached strings belong to the domain, as well as the variants of all its descendents through the hyponymy relation. The procedure we propose in this paper is fully automatic and, a priori, domain-independent. The only external knowledge it uses is a set of terms, which is an external vocabulary, which is considered to have at least one sense belonging to the domain.
This paper presents EASY (Evaluation of Analyzers of SYntax), an ongoing evaluation campaign of syntactic parsing of French, a subproject of EVALDA in the French TECHNOLANGUE program. After presenting the elaboration of the annotation formalism, we describe the corpus building steps, the annotation tools, the evaluation measures and finally, plans to produce a validated large linguistic resource, syntactically annotated
The annotation of the Prague Dependency Treebank (PDT) is conceived of as a multilayered scenario that comprises also dependency representations (tectogrammatical tree structures, TGTS's) of the underlying structure of the sentences. TGTS's capture three basic aspects of the underlying structure of sentences: (a) the dependency tree structure, (b) the kinds of dependency syntactic relations, and (c) the basic characteristics of the topic-focus articulation (TFA). Since the PDT is a large collection and the annotations on the deepest layer are to a large extent performed by several human annotators (based on an automatic preprocessing module), it is more than necessary to observe the consistence of annotators and the agreement among them. In the present paper, we summarize the results of the evaluation of parallel annotations of several samples taken from PDT and the measures accepted to improve the consistency of annotations.
Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17th and 19th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% -- 97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them ‘future proof’, we need to evaluate their potential both synchronically and diachronically across genres.
An unified language for the communicative acts between agents is essential for the design of multi-agents architectures. Whatever the type of interaction (linguistic, multimodal, including particular aspects such as force feedback), whatever the type of application (command dialogue, request dialogue, database querying), the concepts are common and we need a generic meta-model. In order to tend towards task-independent systems, we need to clarify the modules parameterization procedures. In this paper, we focus on the characteristics of a meta-model designed to represent meaning in linguistic and multimodal applications. This meta-model is called MMIL for MultiModal Interface Language, and has first been specified in the framework of the IST MIAMM European project. What we want to test here is how relevant is MMIL for a completely different context (a different task, a different interaction type, a different linguistic domain). We detail the exploitation of MMIL in the framework of the IST OZONE European project, and we draw the conclusions on the role of MMIL in the parameterization of task-independent dialogue managers.
This paper deals with the STO lexicon, the most comprehensive computational lexicon of Danish developed for NLP/HLT applications, which is now ready for use. Danish was one of the 12 EU-languages participating in the LE-PAROLE and SIMPLE projects; therefore it was obvious to continue this work building on our experience obtained from these projects. The material for Danish produced within these projects – further enriched with language-specific information - is incorporated into the STO lexicon. First, we describe the main characteristics of the lexical coverage and linguistic content of the STO lexicon; second, we present some recent uses and point to some prospective exploitations of the material. Finally, we outline an internet-based user interface, which allows for browsing through the complex information content of the STO lexical database and some other selected WRL’s for Danish.
A key element for the extraction of information in a natural language document is a set of shallow text analysis rules, which are typically based on pre-defined linguistic patterns. Current Information Extraction research aims at the automatic or semi-automatic acquisition of these rules. Within this research framework, we consider in this paper the potential for acquiring generic extraction patterns. Our research is based on the hypothesis that, terms (the linguistic representation of concepts in a specialised domain) and Named Entities (the names of persons, organisations and dates of importance in the text) can together be considered as the basic semantic entities of textual information and can therefore be used as a basis for the conceptual representation of domain specific texts and the definition of what constitutes an information extraction template in linguistic terms. The extraction patterns discovered by this approach involve significant associations of these semantic entities with verbs and they can subsequently be translated into the grammar formalism of choice.
The aim of the MEDIA project is to design and test a methodology for the evaluat ion of context-dependent and independent spoken dialogue systems. We propose an evaluation paradigm based on the use of test suites from real-world corpora and a common semantic representation and common metrics. This paradigm should allow us to diagnose the context-sensitive understanding capability of dialogue system s. This paradigm will be used within an evaluation campaign involving several si tes all of which will carry out the task of querying information from a database .
The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances.
We are working on a project called CAOS - Computer-Aided Ontology Structuring - whose aim is to develop a computer system designed to enable semi-automatic construction of concept systems, or ontologies. The system is intended to be interactive and presupposes an end-user with a terminological background (terminologist or professional translator). CAOS supports terminological concept modelling. The backbone of this concept modelling is constituted by characteristics modelled by formal feature specifications, i.e. attribute-value pairs. Our use of feature specifications is subject to a number of principles and constraints. In this paper we want to demonstrate some of these principles and to show why they are necessary in order to permit the construction of an interactive tool for building terminological ontologies. We will also show how they contribute to determine the structuring of the ontologies in CAOS and to facilitate the work of the terminologist user.
In the past, fundamental linguistic research was typically conducted on small data sets that were handcrafted for the specific research at hand. However, from the eighties onwards, many large spoken language corpora have become available. This study investigates the usefulness of large multi-purpose spoken language corpora for fundamental linguistic research. A research task was designed in which we tried to capture the major pronunciation differences between three speech styles in context-sensitive re-write rules at the phone level. These re-write rules were extracted from the alignments of both a manual phonetic transcription and an automatic phonetic transcription with a canonical reference transcription of the same material.
The West African Language Archive (WALA) initiative has emerged from a number of concurrent projects, and aims to encourage local scholars to create high quality decentralised repositories documenting West African languages, and to make these repositories available to language communities, language planners, educationalists and scientists via an internet metadata portal such as OLAC (Open Language Archive Community). A wide range of criteria has to be met in designing and implementing this kind of archive. We discuss these criteria with reference to experiences in documentation work in three very different ongoing language documentation projects, on designing an encyclopaedia, on documenting an endangered language, and on creating a speech synthesiser. We pay special attention to the provision of metadata, a formal variety of catalogue or housekeeping information, without which resources are doomed to remain inaccessible.
A novel thesaurus named a gword-sense association networkh is proposed for the first time. It consists of nodes representing word senses, each of which is defined as a set consisting of a word and its translation equivalents, and edges connecting topically associated word senses. This word-sense association network is produced from a bilingual dictionary and comparable corpora by means of a newly developed fully automatic method. The feasibility and effectiveness of the method were demonstrated experimentally by using the EDR English-Japanese dictionary together with Wall Street Journal and Nihon Keizai Shimbun corpora. The word-sense association networks were applied to word-sense disambiguation as well as to a query interface for information retrieval.
This paper reports on an ongoing project that uses varied language resources and advanced NLP tools for a linguistic classification task in discourse semantics. The system we present is designed to assign a "situation entity" class label to each predicator in English text. The project goal is to achieve the best-possible identification of situation entities in naturally-occurring written texts by implementing a robust system that will deal with real corpus material, rather than just with constructed textbook examples of discourse. In this paper we focus on the combination of multiple information sources, which we see as being vital for a robust classification system. We use a deep syntactic grammar of English to identify morphological, syntactic, and discourse clues, and we use various lexical databases for fine-grained semantic properties of the predicators. Experiments performed to date show that enhancing the output of the grammar with information from lexical resources improves recall but lowers precision in the situation entity classification task.
This paper introduces a tool \Bonsai which supports human in annotating corpora with morphosyntactic information, and in retrieving syntactic structures stored in the database. Integrating annotation and retrieval enables users to annotate a new instance while looking back at the already annotated sentences which share the similar morphosyntactic structure. We focus on the retrieval part of the system, and describe a method to decompose a large input query into smaller ones in order to gain retrieval efficiency. The proposed method is evaluated with the Penn Treebank corpus, showing significant improvements.
We have already proposed a framework to represent a location in terms of both symbolic and numeric aspects. In order to deal with vague linguistic expressions of a location, the representation adopts a potential function mapping a location to its plausibility. This paper proposes classification of Japanese spatial nouns and potential functions corresponding to each class. We focused on a common Japanese spatial expression ``X no Y (Y of X)'' where X is a reference object and Y is a spatial noun. For example, ``tukue no migi (the right of the desk)'' denotes a location with reference to the desk. This expression were collected from corpora, and spatial nouns appearing in the Y position were classified into two major classes; designating a part of the reference object and designating a location apart from the reference object . And the latter class were further classified into two subclasses; direction-oriented and distance-oriented. For each class, a potential function were designed for providing meaning of spatial nouns.
We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.
This paper describes a novel clustering-based text summarization system that uses Multiple Sequence Alignment to improve the alignment of sentences within topic clusters. While most current clustering-based summarization systems base their summaries only on the common information contained in a collection of highly-related sentences, our system constructs more informative summaries that incorporate both the redundant and unique contributions of the sentences in the cluster. When evaluated using ROUGE, the summaries produced by our system represent a substantial improvement over the baseline, which is at 63% of the human performance.
Multi-document summaries produced via sentence extraction often suffer from a number of cohesion problems, including dangling anaphora, sudden shifts in topic and incorrect or awkward chronological ordering. Therefore, the development of an automated revision process to correct such problems is a research area of current interest. We present the RevisionBank, a corpus of 240 extractive, multi-document summaries that have been manually revised to promote cohesion. The summaries were revised by six linguistic students using a constrained set of revision operations that we previously developed. In the current paper, we describe the process of developing a taxonomy of cohesion problems and corrective revision operators that address such problems, as well as an annotation schema for our corpus. Finally, we discuss how our taxonomy and corpus can be used for the study of revision-based multi-document summarization as well as for summary evaluation.
In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus.
Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many instances of paraphrasing, information overlap and even contradiction. The current paper presents the Cross-document Structure Theory (CST) Bank, a collection of multi-document clusters in which pairs of sentences from different documents have been annotated for cross-document structure theory relationships. We will describe how we built the corpus, including our method for reducing the number of sentence pairs to be annotated by our hired judges, using lexical similarity measures. Finally, we will describe how CST and the CST Bank can be applied to different research areas such as multi-document summarization.
This paper reports on a number of experiments in which we applied standard techniques from NLP in the context of documentation of endangered languages. We concentrated on the use of existing, freely available toolkits. Specifically, we explore the use of Finite-State Morphological Analysis, Maximum Entropy Part-of-Speech Tagging, and N-Gram Language Modeling.
When the relevance feedback, which is one of the most popular information retrieval model, is used in an information retrieval system, a related word is extracted based on the first retrival result. Then these words are added into the original query, and retrieval is performed again using updated query. Generally, Using such query expansion technique, retrieval performance using the query expansion falls in comparison with the performance using the original query. As the cause, there is a few synonyms in the thesaurus and although some synonyms are added to the query, the same documents are retireved as a result. In this paper, to solve the problem over such related words, we propose latent context relevance in consideration of the relevance between query and each index words in the document set.
This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.
This paper presents a simple method for performing a lexical analysis of agglutinative languages like Korean, which have a heavy morphology. Especially, for nouns and adverbs with regular morphological modifications and/or high productivity, we do not need to artificially construct huge dictionaries of all inflected forms of lemmas. To construct a dictionary of lemmas and lexical transducers, first, we construct automatically a dictionary of all inflected forms from KAIST POS-Tagged Corpus. Secondly, we separate the party of lemmas and one of sequences of inflectional suffixes. Thirdly, we describe their lexical transducers (i.e., morphological rules) to recognize all inflected forms of lemmas for nouns and adverbs according to the combinatorial restrictions between lemmas and their inflectional suffixes. Finally, we evaluate the advantages of this method.
Style guides or writing recommendations play an important role in the field of technical documentation production, e.g. in industrial contexts. Also, writing recommendations are used in technical contexts together with machine translation (MT) in order to circumvent the MT system's weaknesses. This paper describes the evaluation and adaptation of a language checker deployed in the project int.unity In this project, both MT and a specialised language checker were adapted to the requirements of non-expert users and a non-technical domain. The language technology was integrated with the groupware platform BSCW to support the multi-lingual communication of geographically distributed teams concerned with trade union work. The users' languages were either German or English, i.e. the users were monolingual. We chose linguatec's server version of Personal Translator 2004 MT system for the German<->English translations. The language checker CLAT for German and English has been developed at IAI. It is used by technical authors to support the production of high-quality technical documentation. The CLAT core system was adapted and extended in order to match the new requirements imposed by both the user profile and the subsequent MT application. In this paper, the focus will be on the assessment and adaptation of style rules for German.
We survey the evaluation methodology adopted in Information Extraction (IE), as defined in the MUC conferences and in later independent efforts applying machine learning to IE. We point out a number of problematic issues that may hamper the comparison between results obtained by different researchers. Some of them are common to other NLP tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Issues specific to IE evaluation include: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an information extraction task, a number of characteristics should be clearly defined. However, in the papers only a few of them are usually explicitly specified. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. The goal is to reach a widespread agreement on such proposal so that future IE evaluations will adopt the proposed methodology, making comparisons between algorithms fair and reliable. In order to achieve this goal, we will develop and make available to the community a set of tools and resources that incorporate a standardized IE methodology.
Metonymy is a creative process that establishes relationships based on contiguity or semantic relatedness between concepts. We outline a mechanism for deriving new concepts from WordNet using metonymy. We argue that by exploiting polysemy in WordNet we can take advantage of the metonymic relations between concepts. The focus of our metonymy generation work has been the creation of noun­ noun compounds that do not already exist in WordNet and which can be profitably added to WordNet. The mechanism of metonymy generation we outline takes a source compound and creates new compounds by exploiting the polysemy associated with hyponyms of the head of the source compound. We argue that metonymy generation is a sound basis for concept creation as the newly created compounds are semantically related to the source concept. We demonstrate that metonymy generation based on polysemy is superior to a method of metonymy generation that ignores polysemy. These new concepts can be used to augment WordNet.
This paper describes some important modifications to the Celex morphological database in the context of the FLaVoR project. FLaVoR aims to develop a novel modular framework for speech recognition, enabling the integration of complex linguistic knowledge sources, such as a morphological model. Morphology is a fairly unexploited linguistic information source speech recognizers could benefit from. This is especially true for languages which allow for a rich set of morphological operations, such as our target language Dutch. In this paper we focus on the exploitation of the Celex Dutch morphological database as the information source underlying two different morphological analyzers being developed within the project. Although the Celex database provides a valuable source of morphological information for Dutch, many modifications were necessary before it could be practically applied. We identify major problems, discuss the implemented solutions and finally experimentally evaluate the effect of our modifications to the database.
In order to investigate the effect of source language on translations, we investigate two variants of a Korean translation corpus. The first variant consists of Korean translations of 162,308 Japanese sentences from the ATR BTEC (Basic Expression Text Corpus). The second variant was made by translating the English translations of the Japanese sentences into Korean. We show that the source language text has a large influence on the target text. Even after normalizing orthographic differences, fewer than 8.3\% of the sentences in the two variants were identical. We describe in general which phenomena differ and then discuss how our analysis can be used in natural language processing.
We present a supervised method for training a sentence level confidence measure on translation output using a human-annotated corpus. We evaluate a variety of machine learning methods. The resultant measure, while trained on a very small dataset, correlates well with human judgments, and proves to be effective on one task based evaluation. Although the experiments have only been run on one MT system, we believe the nature of the features gathered are general enough that the approach will also work well on other systems.
The aim of this paper is to discuss aspects of an on-going project on the development of grammatical and lexical resources for Zulu with sufficient coverage for unrestricted text. We explain how the basic software tools of computational morphology are used in linguistic processing, more specifically for automatic word form recognition and morphological tagging of the growing stock of electronic text corpora of a Bantu language such as Zulu. It is also shown how a machine-readable lexicon is in turn enhanced with the information acquired and extracted by means of such corpus analysis.
The purpose of this paper is to introduce an alternative word association measure aimed at addressing the under-extraction collocations that contain high frequency words. While measures such as MI provide the important contribution of filtering out sheer high frequency of words in the detection of collocations in large corpora, one side effect of this filtering is that it becomes correspondingly difficult for such measures to detect true collocations involving high frequency words. As an alternative, we propose normalizing the MI measure by dividing the frequency of a candidate lexeme by the number of senses of that lexeme. We premise this alternative approach on the one sense per collocation assumption of Yarowsky (1992; 1995). Ten verb-noun collocations involving three high frequency verbs (make, take, run) are used to compare the extraction results of traditional MI and the proposed normalized MI. Results show the ranking of these high-frequency verbs as candidate collocates with the target focal nouns is raised by normalizing MI as proposed. Side effects of these improved rankings are discussed, such as increase in false positives resulting from higher recall. It is found that overall rank precision remains quite stable even with the increased recall of normalized MI.
In this paper, we propose an annotation scheme which can be used not only for annotating coreference relations between linguistic expressions, but also those among linguistic expressions and images, in scientific texts such as biomedical articles. Images in biomedical domain often contain important information for analyses and diagnoses, and we consider that linking images to textual descriptions of their semantic contents in terms of coreference relations is useful for multimodal access to the information. We present our annotation scheme and the concept of a "coreference pool," which plays a central role in the scheme. We also introduce a support tool for text annotation named Open Ontology Forge which we have already developed, and additional functions for the software to cover image annotations (ImageOF) which is now being developed.
The development of the evaluation of domain-specific cross-language information retrieval (CLIR) is shown in the context of the Cross-Language Evaluation Forum (CLEF) campaigns from 2000 to 2003. The pre-conditions and the usable data and additionally available instruments are described. The main goals of this task of CLEF are to allow the evaluation of Cross-Language Information Retrieval (CLIR) systems in the context of structured data and in a domain-specific area (not in the more general context of floating, journalistic texts), and with the additional possibility to make use of thesauri which had been used for intellectual indexing of the documents and are provided with the data. The parallel German-English GIRT4 corpus is described and some of the results of the CLEF 2004 campaign are discussed.
This paper shows on the basis of a corpus study how a model of the context should be structured for the generation of coreferring descriptions in French. We show that this way of structuring the context can help to generate more paraphrases and a particular kind of referring expressions used to add information about the referent.
An innovative way of integrating Translation Memory (TM) and Machine Translation (MT) processing is presented which goes beyond the traditional "cascade" integration of Translation Memory and Machine Translation. The new method aims to automatically post-edit TM similar matches by the use of an MT module thus enhancing the TM fuzzy (similar) scores as well as enabling the utilisation of low-score TM fuzzy matches. This leads to substantial translation cost reduction. The suggested method, which can be classified as an Example-Based Machine Translation application, is analysed and examples are provided for clarification. It is evaluated through test results that involve human interaction. The method has been implemented within the ESTeam Translator (ET) Language Toolbox and is already in use in the various commercial installations of ET.
After the successful completion of the Spoken Dutch Corpus (1998 -- 2003) the time is ripe to take some time to sit back and reflect on our achievements and the procedures underlying them in order to learn from our experiences. In this paper we will in particular pay attention to issues affecting the levels of linguistic annotation, but some more general issues deserve to be treated as well (bug reporting, consistency). We will try to come up with solutions, but sometimes we want to invite further discussion from other researchers.
Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.
We applied data-driven methods to carry out automatic acquisition of Dutch prepositional support verb constructions (SVCs) in corpora (e.g., iets in de gaten houden (``keep an eye on something'')). This paper addresses the question whether linguistic diagnostics help to discard noise from the nbest lists and how to (semi-)automatically apply such linguistic diagnostics to parsed corpora. We show that some of the linguistic diagnostics proposed in Hollebrandse (1993) effectively identify SVCs and contribute a modest error rate decrease.
A majority of Machine Aided Translation systems are based on comparisons between a source sentence and reference sentences stored in Translation Memories (TMs). The translation search is done by looking for sentences in a database which are similar to the source sentence. TMs have two basic limitations: the dependency on the repetition of complete sentences and the high cost of building a TM. As human translators do not only remember sentences from their preceding translations, but they also decompose the sentence to be translated and work with smaller units, it would be desirable to enrich the TM database with smaller translation units. This enrichment should also be automatic in order not to increase the cost of building a TM. We propose the application of two automatic bilingual segmentation techniques based on statistical translation methods in order to create new, shorter bilingual segments to be included in a TM database. An evaluation of the two techniques is carried out for a bilingual Basque-Spanish task.
This paper reflects on the recently completed African Speech Technology (AST) Project. The AST Project successfully developed eleven annotated telephone speech databases for five languages spoken in South Africa i.e. Xhosa, Southern Sotho, Zulu, English and Afrikaans. These databases were used to train and test speech recognition systems applied in a multilingual telephone-based prototype hotel booking system. An overview is given of the database design and contents. The acquisition of the data is discussed with regards to the telephony interface, as well as speaker recruitment and briefing. Particular reference is given to some of the practical implications of acquiring appropriate data in under-developed communities. Database management processes such as transcription, quality control and validation are explained. This is followed by information on the development of the prototype. Results of usability tests are discussed followed by an assessment of the Project as a whole.
The CGN corpus (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.
In this paper, we investigate whether a dataset derived from a multi-purpose corpus such as the Spoken Dutch Corpus may be considered appropriate for developing a taxonomy of wh-questions, and a model of the way in which these questions are integrated in spoken discourse. We compare the results obtained from the Spoken Dutch Corpus with a similar analysis of a large random collection of FAQs from the internet. We find substantial differences between the questions in spoken discourse and FAQs. Therefore, it may not be trivial to use a general purpose corpus as a starting point for developing models for human-computer interaction.
For many years the Istituto di Teoria e Tecniche dell'Informazione Giuridica (ITTIG) of the Consiglio Nazionale delle Ricerche has studied the evolution of legal language, creating databases for documentation and digital retrieval of law texts. The ITTIG is attending to document legal language through information technology in order to provide as wide an access as possible to its findings. The Institute has recently created an on-line digital database that includes the full text of the most important Italian laws (Codes and Constitutions) from the 16th to the 20th century. The ITTIG is also in the process of preparing another database made up of contexts from the original 10th to the 20th century legal sources.
An architecture is presented that provides an integrated framework for managing, archiving and accessing language resources. This architecture was discussed in the DELAMAN network – a world-wide network of archives holding material about endangered languages. Such a framework will be built upon a metadata infrastructure, a mechanism to resolve unique resource identifiers, user and access rights management components. These components are closely related and have to be based on redundant and distributed services. For all these components existing middleware seems to be available, however, it has to be checked how they can interact with each other.
This paper presents specifications and requirements for creation and validation of large lexica that are needed in automatic Speech Recognition (ASR), Text-to-Speech (TTS) and statistical Speech-to-Speech Translation (SST) systems. The prepared language resources are created and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during years 2002-2005. Large lexica consisting of phonetic, suprasegmental and morpho-syntactic content will be provided with well-documented specifications for 13 languages. A short summary of the LC-STAR project itself is presented. Overview about the specification for the corpora collection and word extraction as well as the specification and format of the lexica are presented. Particular attention is paid to the validation of the produced lexica and the lessons learnt during pre-validation. The created and validated language resources will be available via ELRA/ELDA.
This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian National Corpus. The methodology has proved to be efficient for those languages that, like Croatian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We also present an extension of the system that uses automatic querying to Internet to acquire those entries for which we have not enough information in our corpus.