Proceedings of Machine Translation Summit VII
What should we do next for MT system development?
For efficiency reasons, Machine Translation systems are generally designed to eliminate ambiguities as early as possible even if delaying the decision would make a more informed choice possible. This paper takes the contrary view, arguing that essentially all choices should be deferred so that large numbers of competing translations will be produced in typical cases. Representing all the data structures in a suitable packed form, much as alternative structures are represented in a chart parser, makes this practicable.
What can MT do for multilingualism on the Net?
The recent rapid spread of the Internet suggests that soon everybody on earth will be able to communicate freely with each other across national borders. The 21st century will be the age of multilingualism and multiculturalism, when various languages and cultures are dynamically exchanged on a global scale. One may call it the New Great Age of Translation. In such an age, Machine Translation (MT) is obviously expected to play an essential role. What kind of technological efforts, then, are required in the 21st century? A new type of MT product will be sought, which is different from the application products for traditional translation or real-time interpretation. This paper first clarifies the coming need, and introduces an interactive and evolutionary MT technology. Secondly, it addresses Han characters (Kanji) which have distinct origin and functions from alphabets. As highly developed visual symbols, Han characters are expected to promote Internet communication.
SAIL: present and future
Retrospect and prospect in computer-based translation
At the last MT Summit conference this century, this paper looks back briefly at what has happened in the 50 years since MT began, reviews the present situation, and speculates on what the future may bring. Progress in the basic processes of computerized translation has not been as dramatic as developments in computer technology and software. There is still much scope for the improvement of the linguistic quality of MT output, which hopefully developments in both rule-based and corpus-based methods can bring. Greater impact on the future MT scenario will probably come from the expected huge increase in demand for on-line real-time communication in many languages, where quality may be less important than accessibility and usability.
Controlled languages for machine translation: state of the art
Controlled language – issues in checkers’ design
The present paper deals with several recurrent issues in the design and implementation of controlled language checkers. It is based on market analysis and on LANT’s experience in building and customizing controlled language checkers.
Controlled language for multilingual machine translation
Translation technology applications for the localization industry
MT and TM technologies in localization industry: the challenge of integration
The objective of this paper is to clarify certain technological aspects of the localization business process. An introduction to the Translation Memory (TM) technology is provided, followed by an analysis of how TM and Machine Translation (MT), when used together, can increase productivity in software localization workflow applications. A special section is devoted to the issue of standard exchange mechanisms to represent translation memory data so that they can be shared among users of different TM and MT tools.
Regional survey: M(A)T in North America
We examine two North American case studies, each of which illustrates a different strategy for coming to terms with high-volume, high-quality translation. The first eschews MT in favour of translation memory technology; the second employs a controlled language to simplify the input to an MT system. Both strategies betray a certain dissatisfaction with the current state of machine translation, although neither alternative, it turns out, fully lives up to its expectations.
Computer assisted translation system – an Indian perspective
Work in the area of Machine Translation has been going on for several decades and it was only during the early 90s that a promising translation technology began to emerge with advanced researches in the field of Artificial Intelligence and Computational Linguistics. This held the promise of successfully developing usable Machine Translation Systems in certain well-defined domains. C-DAC took up this challenge, as we felt that India, being a multi-lingual and multi-cultural country with a population of approximately 950 million people and 18 constitutionally recognized languages, needs a translation system for instant transfer of information and knowledge. The other groups who are working in this area of English to Hindi Translation are National Center for Software Technology (NCST), who are working on translation of News Stories and Electronics Research & Development Center of India (ER & DCI). who have developed the Machine Assisted Translation System for the Health Domain. A major project on Indian Languages to Indian Languages Translation (Anusaaraka) is also under development at University of Hyderabad.
The research and development of machine translation in China
This survey of the situation regarding the R&D of machine translation in China concentrates on how the design and application of MT systems could be on the shadow of the characteristics of Chinese Language. After a brief historical overview and the investigation on R&D environment, some technical features are described, including commercialization, Chinese-English language pair and a multi-level transfer in English-Chinese MT systems.
Report on machine translation market in Japan
This paper reports the current situation of the machine translation (MT) market in Japan, based on a survey conducted through questionnaires and interviews. The research targets three groups: MT manufacturers (including sales agents), professional translators and translation agencies, and general users. We completed the questionnaire on the first group and are now querying the second group through interviews and questionnaires. According to the survey of manufacturers and vendors, shipments and sales of MT systems plunged during 1996 to 1998. but respondents are expecting a slight recovery in 1999 and 2000. The primary requirement to raise shipments and sales is improvement of translation quality, most respondents believe. The survey of translation professionals started with the first interview on June 25. We plan to interview at least 20 people in the translation industry in four meetings. The results will be orally reported at the conference site. We are also designing the questionnaire for general users, which we plan to finish by the end of this year.
Machine translation in Korea
This report introduces the current situation of machine translation in Korea. Recently, the need for further developing machine translation has been generally recognized. Although a few machine translation softwares for Korean have been released on the market, they do not sufficiently meet the need of users. As a result, the Korean machine translation field is only a niche market. However, several projects are underway in Korea which include world-wide technical cooperation. This report surveys the history of machine translation in Korea and describes the current market, R&D status, and current technical difficulties.
Prospects for advanced speech translation
Speech communication includes many important issues on natural language processing and they are related with desirable advanced speech translation systems. Advanced systems need to be able to handle the interaction for speech communication, pragmatics in speech, and arbitrariness of speech usage. General characteristics of speech communication are discussed. Also the various viewpoints regarding interaction, pragmatics, and arbitrary usage are discussed. Some of the present speech translation approaches are explained and new basic technologies are introduced. In this paper, a synthetic NLP technology such as a composite art form is proposed for speech communication and speech translation.
Robust spoken translation at ITC-IRST
In this paper the ITC-irst research issues and approach to the spoken translation problem will be presented together with a description of the demonstration system developed in the framework of C-STAR II Consortium. The challenge of future applications in the e-commerce and e-service sectors will also be presented and discussed.
Translation systems under the C-STAR framework
This talk will review our work on Speech Translation under the recent worldwide C-STAR demonstration. C-STAR is the Consortium for Speech Translation Advanced Research and now includes 6 partners and 20 partner/affiliate laboratories around the world. The work demonstrated concludes the second phase of the consortium, which has focused on translating conversational spontaneous speech as opposed to well formed, well structured text. As such, much of the work has focused on exploiting semantic and pragmatic constraints derived from the task domain and dialog situation to produce an understandable translation. Six partners have connected their respective systems with each other and allowed travel related spoken dialogs to provide communication between each of them. A common Interlingua representation was developed and used between the partners to make this multilingual deployment possible. The systems were also complemented by the introduction of Web based shared workspaces that allow one user in one country to communicate pictures, documents, sounds, tables, etc. to the other over the Web while referring to these documents in the dialog. Some of the partners' systems were also deployed in wearable situations, such as a traveler exploring a foreign city. In this case speech and language technology was installed on a wearable computer with a small hand-held display. It was used to provide language translation as well as human-machine information access for the purpose of navigation (using GPS localization) and tour guidance. This combination of human-machine and human-machine-human dialogs could allow a user explore a foreign environment more effectively by resorting to human-machine and human-human dialogs wherever most appropriate.
A research perspective on how to democratize machine translation and translation aids aiming at high quality final output
Machine Translation (MT) systems and Translation Aids (TA) aiming at cost-effective high quality final translation are not yet usable by small firms, departments and individuals, and handle only a few languages and language pairs. This is due to a variety of reasons, some of them not frequently mentioned. But commercial, technical and cultural reasons make it mandatory to find ways to democratize MT and TA. This goal could be attained by: (1) giving users, free of charge, TA client tools and server resources in exchange for the permission to store and refine on the server linguistic resources produced while using TA; (2) establishing a synergy between MT and TA, in particular by using them jointly in translation projects where translators codevelop the lexical resources specific to MT; (3) renouncing the illusion of fully automatic general purpose high quality MT (FAHQMT) and go for semi-automaticity (SAHQMT), where user participation, made possible by recent technical network-oriented advances, is used to solve ambiguities otherwise computationnally unsolvable due to the impossibility, intractability or cost of accessing the necessary knowledge; (4) adopting a hybrid (symbolic & numerical) and "pivot" approach for MT, where pivot lexemes arc UNL or UNL inspired English-oriented denotations of (sets of) interlingual acceptions or word/term senses, and the rest of the representation of utterances is either fully abstract and interlingual as in UNL, or, less ambitiously but more realistically, obtained by adding to an abstract English multilevel structure features underspecified in English but essential for other languages, including minority languages.
From parallel grammar development towards machine translation – a project overview
We give an overview of a MT research project jointly undertaken by Xerox PARC and XRCE Grenoble. The project builds on insights and resources in large-scale development of parallel LFG grammars. The research approach towards translation focuses on innovative computational technologies which lead to a flexible translation architecture. Efficient processing of "packed" ambiguities not only enables ambiguity preserving transfer. It is at the heart of a flexible architectural design, open for various extensions which take the right decisions at the right time.
MT from the research perspective
There has been a wide range of research and development work on machine translation in terms of goals and objectives as well as emphasis. Some are fundamental, some are engineering based and some are product-driven. Different researchers may be motivated by different reasons, some by funding, some by potential commercial returns, some by challenges on the application of various technologies and some by sheer search of knowledge. Trends also tend to emerge in terms of the 'best' approach at a given point in time. This short discussion advocates the possibility of synergising among all types of research while working towards different goals as opposed to looking for the best direction(s) to follow.
FAMT is alive and well
This invited talk describes the use of fully automatic machine translation (FAMT) at the Pan American Health Organization. Statistics covering 1998 are presented and analyzed in terms of productivity and cost savings. Feedback from several outside users of PAHO's translation software is also reported. Problems encountered in implementing machine translation in an international organization are discussed from the points of view of managers, translators, and end users. The talk concludes with a quick glimpse at what PAHO's MT development staff has been working on this year.
Applications using multilinguality: IR, summarization and generalization
Among the three main applications using multilinguality, i.e., information retrieval, summarization and text generation, the first one could be considered to be the core and the other two its supporting technologies. Information retrieval using multilinguality often appears in the form of allowing a query specified in a language to be answered by documents or information in one or more different languages. Summarization supports information retrieval by producing a database of intermediate representation of original documents, which contains only central and essential information. Text generation with multilingual capability helps create retrieved information in a desirable natural language. This brief paper identifies some issues regarding these three applications with emphasis on information retrieval.
A scalable cross-language metasearch architecture for multilingual information access on the Web
This position paper for the special session on "Multilingual Information Access" comprises of three parts. The first part reviews possible demands for Multilingual Information Access (hereafter, MLIA) on the Web, and examines required technical elements. Among those, we, in the second part, focus on Cross-Language Information Retrieval (hereafter, CLIR), particularly a scalable architecture which enables CLIR in a number of language combinations. Such a distributed architecture developed around XIRCH project (an international joint experimental project currently involves NTT, KRDL, and KAIST) is then described in a certain detail. The final part discusses some NLP/MT related issues associated with such a CLIR architecture.
Complementing dictionary-based query translations with corpus statistics for cross-language IR
Sung Hyon Myaeng
For cross-language information retrieval (CLIR), often queries or documents are translated into the other language to create a mono-lingual information retrieval situation. Having surveyed recent research results on translation-based CLIR, we have convinced ourselves that an effective query translation method is an essential element for a practical CLIR system with a reasonable quality. After summarizing the arguments and methods for query translation and survey results for dictionary-based translation methods, this paper describes a relatively simple yet effective method of using mutual information to handle the ambiguity problem known to be the major factor for low performance compared to mono-lingual situation. Our experimental results based on the TREC-6 collection shows that this method can achieve up to 85% of the monolingual retrieval case and 96% of the manual disambiguation case.
Machine translation for the next century
The panel intends to pick up some of the issues discussed in the Summit and discuss them further in the final session from broader perspectives. Since the Summit has not even started yet, I will just enumerate in this paper a list of possible perspectives on MT that I hope are relevant to our discussion.
Otelo and the Domino translation object
Lotus is working with SAP and a number of MT vendors to make usage of MT easier. Most of this work has been done in the framework of the Otelo project. It is also part of Lotus' efforts to make development of multilingual web applications much simpler. Terminology interchange and text interchange formats as well as a Linguistic Services API are discussed. Also covered is the Domino Translation Object which enables use of these technologies on the Domino infrastructure.
Sharing dictionaries among MT users by common formats and social filtering framework
MT users have to build "user dictionaries" in order to obtain high-quality translation results. However, building dictionaries needs time and labor. In order to meet the speed of the information flow in the global network society, we need to have common formats for sharing dictionaries among different MT systems, and a new way of dictionary authorization, that is "social filtering".
A customizable, self-learning parameterized MT system: the next generation
In this paper, the major problems of the current machine translation systems are first outlined. A new direction, highlighting the system capability to be customizable and self-learnable, is then proposed for attacking the described problems, which are mainly resulted from the very complicated characteristics of natural languages. The proposed solution adopts an unsupervised two-way training mechanism and a parameterized architecture to acquire the required statistical knowledge, such that the system can be easily adapted to different domains and various preferences of individual users.
Human language technologies for the information society: roles, plans and visions of funding agencies
Benjamin K. Tsou
This panel deals with the general topic of evaluation of machine translation systems. The first contribution sets out some recent work on creating standards for the design of evaluations. The second, by Eduard Hovy. takes up the particular issue of how metrics can be differentiated and systematized. Benjamin K. T'sou suggests that whilst men may evaluate machines, machines may also evaluate men. John S. White focuses on the question of the role of the user in evaluation design, and Yusoff Zaharin points out that circumstances and settings may have a major influence on evaluation design.
Applying TDMT to abstracts on science and technology
In this paper, we discuss applying a translation model, "Transfer Driven Machine Translation" (TDMT), to document abstracts on science and technology. TDMT, a machine translation model, was developed by ATR-ITL to deal with dialogues in the travel domain. ATR-ITL has reported that the TDMT system efficiently translates multi-lingual spoken-dialogs. However, little is known about the ability of TDMT to translate written text translations; therefore, we examined TDMT with written text from English to Japanese, especially abstracts on science and technology produced by the Japan Science and Technology Corporation (JST). The experimental results show that TDMT can derive written text translation.
UNL-French deconversion as transfer & generation from an interlingua with possible quality enhancement through offline human interaction
We present the architecture of the UNL-French deconverter, which "generates" from the UNL interlingua by first "localizing" the UNL form for French, within UNL, and then applying slightly adapted but classical transfer and generation techniques, implemented in GETA's Ariane-G5 environment, supplemented by some UNL-specific tools. Online interaction can be used during deconversion to enhance output quality and is now used for development purposes. We show how interaction could be delayed and embedded in the postedition phase, which would then interact not directly with the output text, but indirectly with several components of the deconverter. Interacting online or offline can improve the quality not only of the utterance at hand, but also of the utterances processed later, as various preferences may be automatically changed to let the deconverter "learn".
Solutions to problems inherent in spoken-language translation: the ATR-MATRIX approach
ATR has built a multi-language speech translation system called ATR-MATRIX. It consists of a spoken-language translation subsystem, which is the focus of this paper, together with a highly accurate speech recognition subsystem and a high-definition speech synthesis subsystem. This paper gives a road map of solutions to the problems inherent in spoken-language translation. Spoken-language translation systems need to tackle difficult problems such as ungrammaticality. contextual phenomena, speech recognition errors, and the high-speeds required for real-time use. We have made great strides towards solving these problems in recent years. Our approach mainly uses an example-based translation model called TDMT. We have added the use of extra-linguistic information, a decision tree learning mechanism, and methods dealing with recognition errors.
Portuguese-Chinese machine translation in Macao
There have been substantial changes in computing practices in the cyberspace, mainly as a result of the proliferation of low priced under-utilized powerfully heterogeneous computers are connected by high-speed links. In this paper we reminisce the vicissitude of computing platform and introduce our Portuguese-Chinese corpus-based machine translation (CBMT) system which employs a statistical approach with automatic bilingual alignment support. Our improved algorithm for aligning bilingual parallel texts can achieve 97% of accuracy. At the same time, we broach the "distributed translation computing" concept to construct a uniform distributed shared-object technical term retrieving workstation and achieve high computing performance balance of network where heterogeneous computers inherently root and are intermittently under-utilized. Whereby it, we can expedite to retrieve technical terms from noisy bilingual web text and build up the Portuguese-Chinese corpus-base.
Example-based machine translation based on the synchronous SSTC annotation schema
Mosleh H. Al-Adhaileh
Tang Enya Kong
In this paper, we describe an Example-Based Machine Translation (EBMT) system for English-Malay translation. Our approach is an example-based approach which relies sorely on example translations kept in a Bilingual Knowledge Bank (BKB). In our approach, a flexible annotation schema called Structured String-Tree Correspondence (SSTC) is used to annotate both the source and target sentences of a translation pair. Each SSTC describes a sentence, a representation tree as well as the correspondences between substrings in the sentence and subtrees in the representation tree. With both the source and target SSTCs established, a translation example in the BKB can then be represented effectively in terms of a pair of synchronous SSTCs. In the process of translation, we first try to build the representation tree for the source sentence (English) based on the example-based parsing algorithm as presented in . By referring to the resultant source parse tree, we then proceed to synthesis the target sentence (Malay) based on the target SSTCs as pointed to by the synchronous SSTCs which encode the relationship between source and target SSTCs.
Inducing translation templates for example-based machine translation
This paper describes an example-based machine translation (EBMT) system which relays on various knowledge resources. Morphologic analyses abstract the surface forms of the languages to be translated. A shallow syntactic rule formalism is used to percolate features in derivation trees. Translation examples serve the decomposition of the text to be translated and determine the transfer of lexical values into the target language. Translation templates determine the word order of the target language and the type of phrases (e.g. noun phrase, prepositional phase, ...) to be generated in the target language. An induction mechanism generalizes translation templates from translation examples. The paper outlines the basic idea underlying the EBMT system and investigates the possibilities and limits of the translation template induction process.
Development of an intranet MT system adapting to usage domain
The machine translation (MT) system came out brilliantly. Today, however, the development of the MT system is losing the vigor it once had. The cause of the system’s infrequent use became clear as the survey of usage patterns in our company progressed. The problem was caused by the fact that the results of the MT system was not as expected. This study analyzed the usage patterns and characteristics of the translated documents, including technical documents of the copier business division’s service department and specifications or drawings prepared in overseas factories. The conclusion drawn from the analysis was that any MT system should include an adequate dictionary and the ability to select an appropriate adverb and verb by applying the co-occurrence rule. Furthermore, an MT system should be able to translate fixed form sentences that are used repeatedly. After the usability of the MT system was improved, the translation staff started using it more frequently at various sections in our company. Moreover, we developed an MT system with the above functions incorporated. Accordingly, the machine-translated documents turned out as expected. In this paper, I will report on the circumstances of our MT development and discuss the requirements for an MT system.
The next step: moving to an integrated MT system for high-volume environments
Walter K. Hartmann
Within the realm of small to medium-sized translation companies, the demands placed on MT in a high-volume production environment plagued with extremely demanding turn-around times and cost pressures are quite different from most other uses of MT. With the help of an analysis of a typical project the author shows the need for MT to become an integrated part of a translation application which will reduce the amount of extraneous processes to a minimum. In conclusion, a system is proposed which will streamline all the ancillary processes in order conform to customers' turn-around demands without jeopardizing post-editing quality.
TransRouter : a decision support tool for translation managers
Translation managers often have to decide on the most appropriate way to deal with a translation project. Possible options may include human translation, translation using a specific terminology resource, translation in interaction with a translation memory system, and machine translation. The decision making involved is complex, and it is not always easy to decide by inspection whether a specific text lends itself to certain kinds of treatment. TransRouter supports the decision making by offering a suite of computer based tools which can be used to analyse the text to be translated. Some tools, such as the word counter, the repetition detector, the sentence length estimator and the sentence simplicity checker look at characteristics of the text itself. A version comparison tool compares the new text to previously translated texts. Other tools, such as the unknown terms detector and the translation memory coverage estimator, estimate overlap between the text and a set of known resources. The information gained, combined with further information provided by the user, is input to a decision kernel which calculates possible routes towards achieving the translation together with their cost and consequences on translation quality. The user may influence the kernel by, for example, specifying particular resources or refining routes under investigation. The final decision on how to treat the project rests with the translation manager.
Deploying the SAE J2450 translation quality metric in MT projects
This paper provides a nutshell description of how the recently published proposal of a translation quality metric for automotive service information is applicable in an evaluation scenario that deploys multilingual human language technology (mHLT). This proposal is the result of the J2450 task force group of the Society of Automotive Engineers (SAE). The main focus of the developed metric is on the syntactic level of a translation product. Since it is our belief that any evaluation of a translation (human and machine) should also take into account the semantic level of a human language product, we have slightly reshaped the SAE J2450 metric. In addition, we have embedded the whole evaluation process into an object-oriented quality model approach to account for the established business processes in the acquisition, production, translation and dissemination of automotive service information in SGML/XML environments. This scenario then provides the solid grounding for the setup of a quality assurance process for all dimensions related to the processing (human and machine) of automotive service information. The work reported here is one part of the ongoing European Multidoc project that has brought together several European automotive companies to taming the complexity of service information products in an integrated way. Within Multidoc integration means first and foremost the coupling of advanced information technology and mHLT. These aspects will be further motivated and detailed in the context of the specification of an evaluation scenario.
Evaluation experiment for reading comprehension of machine translation outputs
This paper proposes evaluation methods for reading comprehension of English to Japanese translation outputs. The methods were designed not only to evaluate the performance of current systems, but to evaluate the performance of future systems had the current problems been solved. The experiments have shown that the proposed methods are capable of producing results that are statistically significant, and that improvement in certain linguistic aspects will result in significant improvement in the comprehension level.
Study on evaluation of WWW MT systems
Compared with off-line machine translation (MT). MT for the WWW has more evaluation factors such as translation accuracy of text, interpretation of HTML tags, consistency with various protocols and browsers, and translation speed for net surfing. Moreover, the speed of technical innovation and its practical application is fast, including the appearance of new protocols. Improvement of MT software for the WWW will enable the sharing of information from around the world and make a great deal of contribution to mankind. Despite the importance of general evaluation studies on MT software for the WWW. it appears that such studies have not yet been conducted. Since MT for the WWW will be a critical factor for future international communication, its study and evaluation is an important theme. This study aims at standardized evaluation of MT for the WWW. and suggests an evaluation method focusing on unique aspects of the WWW independent of text. This evaluation method has a wide range of aptitude without depending on specific languages. Twenty-four items specific to the WWW were actually evaluated with regard to six MT software for the WWW. This study clarified various issues which should be improved in the future regarding MT software for the WWW and issues on evaluation technology of MT on the Internet.
A new evaluation method for speech translation systems and a case study on ATR-MATRIX from Japanese to English
ATR-MATRIX is a multi-lingual speech-to-speech translation system designed to facilitate communications between two parties of different languages engaged in a spontaneous conversation in a travel arrangement domain. In this paper, we propose a new evaluation method for speech translation systems. Our current focus is on measuring the robustness of a language translation sub-system, with quick calculation and low cost. Therefore, we calculate the difference between the translation output from transcription texts and the translation output from input speech by a dynamic programming method. We present the first trial experiment of this method applied to our Japanese-to-English speech translation system. We also provide related discussions on such points as error analysis and the relationship between the proposed method and translation quality evaluation manually done by humans.
Machine translation for information access across the language barrier: the MuST system
In this paper we describe the design and implementation of MuST, a multilingual information retrieval, summarization, and translation system. MuST integrates machine translation and other text processing services to enable users to perform cross-language information retrieval using available search services such as commercial Internet search engines. To handle non-standard languages, a new Internet indexing agent can be deployed, specialized local search services can be built, and shallow MT can be added to provide useful functionality. A case study of augmenting MuST with Indonesian is included. MuST adopts ubiquitous web browsers as its primary user interface, and provides tightly integrated automated shallow translation and user biased summarization to help users quickly judge the relevance of documents.
Multilingual document language recognition for creating corpora
In this paper we describe a language recognition algorithm for multilingual documents that is based on mixed-order n-grams, Markov chains, maximum likelihood, and dynamic programming. We present the results of an experimental study that showed that the performance of this algorithm has practical value.
Interactive MT as support for non-native language authoring
The paper describes an approach to developing an interactive MT system for translating technical texts on the example of translating patent claims between Russian and English. The approach conforms to the human-aided machine translation paradigm. The system is meant for a source language (SL) speaker who does not know the target language (TL). It consists of i) an analysis module which includes a submodule of interactive syntactic analysis of SL text and a submodule of fully automated morphological analysis, ii) an automatic module for transferring the lexical and partially syntactic content of SL text into a similar content of the TL text and iii) a fully automated TL text generation module which relies on knowledge about the legal format of TL patent claims. An interactive analysis module guides the user through a sequence of SL analysis procedures, as a result of which the system produces a set of internal knowledge structures which serve as input to the TL text generation. Both analysis and generation rely heavily on the analysis of the sublanguage of patent claims. The model has been developed for English and Russian as both SLs and TLs but is readily extensible to other languages.
Formalizing translation memories
The TELA structure, a set of layered and linked lattices, and the notion of Similarity between TELA structures, based on the Edit Distance, are introduced in order to formalize Translation Memories (TM). We show how this approach leads to a real gain in recall and precision, and allows extending TM towards rudimentary, yet useful Example-Based Machine Translation that we call Shallow Translation.
The PARS family of MT systems : a 15-year love story
Michael S. Blekhman
The paper shows the history of developing the PARS family of commercial machine translation systems for Russian, Ukrainian, English, and German, developed by Lingvistica '98 Inc. It discusses three aspects: retrospective, technological, and linguistic The main focus is on dictionary updating as one of the most important components of a commercial MT product. Each of the PARS systems features a unique tagging option, which makes it possible for the user to have grammatical data assigned automatically to Russian and Ukrainian words entered into the dictionaries. Besides, PARS dictionary officers make use of the batch-mode tagging technology, due to which PARS features very large bidirectional Russian-English general and specialist dictionaries of more than 1,000,000 translations for each translation direction, as well as large bidirectional Ukrainian-English professional dictionaries. The PARS family was designed in the mid 1980s, and it has been and is now in commercial use since 1989 all over the world.
Parallel text collections at Linguistic Data Consortium
The Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. This paper describes past and current work on creation of parallel text corpora, and reviews existing and upcoming collections at LDC.
The ELAN Slovene-English aligned corpus
Multilingual parallel corpora are a basic resource for research and development of MT. Such corpora are still scarce, especially for lower-diffusion languages. The paper presents a sentence-aligned tokenised Slovene-English corpus, developed in the scope of the EU ELAN project. The corpus contains 1 million words from fifteen recent terminology-rich texts and is encoded according to the Guidelines for Text Encoding and Interchange (TEI). Our document type definition is a parametrisation of the TEI which directly encodes translation units of the bi-texts. in a manner similar to that of translation memories. The corpus is aimed as a widely-distributable dataset for language engineering and for translation and terminology studies. The paper describes the compilation of the corpus, its composition, encoding and availability. We highlight the corpus acquisition and distribution bottlenecks and present our solutions. These have to do with the workflow in the project, and. not unrelatedly, with the encoding scheme for the corpus.
Harmonised large-scale syntactic/semantic lexicons: a European multilingual infrastructure
The paper aims at providing an overview of the situation of Language Resources (LR) in Europe, in particular as emerging from a few European projects regarding the construction of large-scale harmonised resources to be used for many applicative purpose, also of multilingual nature. An important research aspect of the projects is given by the very fact that the large enterprise described is, at our knowledge, the first attempt at developing wide-coverage lexicons for so many languages (12 European languages), with a harmonised common model, and with encoding of structured "semantic types" and semantic (subcategorisation) frames on a large scale. Reaching a common agreed model grounded on sound theoretical approaches within a very large consortium is in itself a challenging task. The actual lexicons will then provide a framework for testing and evaluating the maturity of the current state-of-the-art in lexical semantics grounded on, and connected to. a syntactic foundation. Another research aspect is provided by the recognition of the necessity of accompanying these "static" lexicons with dynamic means of acquiring lexical information from large corpora. This is one of the challenging research aspects of a global strategy for building a large and useful multilingual LR infrastructure.
Developing knowledge bases for MT with linguistically motivated quality-based learning
In this paper we present a proposal to help bypass the bottleneck of knowledge-based systems working under the assumption that the knowledge sources are complete. We show how to create, on the fly, new lexicon entries using lexico-semantic rules and how to create new concepts for unknown words, investigating a new linguistically-motivated model to trigger concepts in context.
A pipelined multi-engine approach to Chinese-to-Korean machine translation: MATES/CK
This paper presents MATES/CK, a Chinese-to-Korean machine translation system. We introduce the design philosophy, component modules, implementation and some other aspects of MATES/CK system in this paper.
Machine translation system PENSÉE: system design and implementation
This paper describes a new version of our machine translation system PENSÉE. In the light of its past systems, new PENSÉE is designed to improve portability from developers' point of view and translation quality from users' point of view. The features of new PENSÉE are: 1) Java implementation and 2) pattern-based transfer approach. In addition, new PENSÉE places a great importance on user interface especially in building user dictionaries. We will discuss why and how we resolve the existing MT problems and present dictionary building tools to support user customization.
Rapid development of translation tools
The Computing Research Laboratory is currently developing technologies that allow rapid deployment of automatic translation capabilities. These technologies are designed to handle low-density languages for which resources, be that human informants or data in electronically readable form, are scarce. All tools are built in an incremental fashion, such that some simple tools (a bilingual dictionary or a glosser) can be delivered early in the development to support initial analysis tasks. More complex applications can be fielded in successive functional versions. The technology we demonstrate has first been applied to Persian-English machine translation within the Shiraz project and is currently extended to cover languages such as Arabic, Japanese, Korean and others.
The use of abstracted knowledge from an automatically sense-tagged corpus for lexical transfer ambiguity resolution
Namwon Heo. Kyounghi Moon
This paper proposes a method for lexical transfer ambiguity resolution using corpus and conceptual information. Previous researches have restricted the use of linguistic knowledge to the lexical level. Since the extracted knowledge is stored in words themselves, these methods require a large amount of space with a low recall rate. On the contrary, we resolve word sense ambiguity by using concept co-occurrence information extracted from an automatically sense-tagged corpus. In one experiment, it achieved, on average, a precision of 82.4% for nominal words, and 83% for verbal words. Considering that the test corpus is completely irrelevant to the learning corpus, this is a promising result.
Towards the automatic acquisition of lexical selection rules
This paper is a study of a certain type of collocations and implication and application to acquisition of lexical selection rules in transfer-approach MT systems. Collocations reveal the co-occurrence possibilities of linguistic units in one language, which often require lexical selection rules to enhance the natural flow and clarity of MT output. The study presents an automatic acquisition and human verification process to acquire collocations and suggest possible candidates for lexical selection rules. The mechanism has been used in the development and enhancement of the Chinese-English and Japanese-English MT systems, and can be easily adapted to other language pairs. Future work includes expanding its usage to more language pairs and furthering its application to MT customers.
A bootstrap approach to automatically generating lexical transfer rules
We describe a method for automatically generating Lexical Transfer Rules (LTRs) from word equivalences using transfer rule templates. Templates are skeletal LTRs, unspecified for words. New LTRs are created by instantiating a template with words, provided that the words belong to the appropriate lexical categories required by the template. We define two methods for creating an inventory of templates and using them to generate new LTRs. A simpler method consists of extracting a finite set of templates from a sample of hand coded LTRs and directly using them in the generation process. A further method consists of abstracting over the initial finite set of templates to define higher level templates, where bilingual equivalences are defined in terms of correspondences involving phrasal categories. Phrasal templates are then mapped onto sets of lexical templates with the aid of grammars. In this way an infinite set of lexical templates is recursively defined. New LTRs are created by parsing input words, matching a template at the phrasal level and using the corresponding lexical categories to instantiate the lexical template. The definition of an infinite set of templates enables the automatic creation of LTRs for multi-word, non-compositional word equivalences of any cardinality.
Target word selection with co-occurrence and translation information
Tong Loong Cheong
Using a target language model for domain independent lexical disambiguation
In this paper we describe a lexical disambiguation algorithm based on a statistical language model we call maximum likelihood disambiguation. The maximum likelihood method depends solely on the target language. The model was trained on a corpus of American English newspaper texts. Its performance was tested using output from a transfer based translation system between Turkish and English. The method is source language independent, and can be used for systems translating from any language into English.
Article selection using probabilistic sense disambiguation
A probabilistic method is used for word sense disambiguation where the features taken are the surrounding six words. As their surface forms are used, no syntactic or semantic analysis is required. Despite its simplicity, this method is able to disambiguate the noun interest accurately. Using the common data set of (Bruce & Wiebe 94), we have obtained an average accuracy of 86.6% compared with their reported figure of 78%. This portable technique can be applied to the task of English article selection. This problem arises from machine translation of any source language without article to English. Using texts from the Wall Street Journal, we achieved an overall accuracy of 83.1% for the 1,500 most commonly used head nouns.
Compound noun decomposition using a Markov model
Yung Taek Kim
A statistical method for compound noun decomposition is presented. Previous studies on this problem showed some statistical information are helpful. But applying statistical information was not so systemic that performance depends heavily on the algorithm and some algorithms usually have many separated steps. In our work statistical information is collected from manually decomposed compound noun corpus to build a Markov model for composition. Two Markov chains representing statistical information are assumed independent: one for the sequence of participants' lengths and another for the sequence of participants ' features. Besides Markov assumptions, least participants preference assumption also is used. These two assumptions enable the decomposition algorithm to be a kind of conditional dynamic programming so that efficient and systemic computation can be performed. When applied to test data of size 5027, we obtained a precision of 98.4%.
English-to-Korean Web translator : “FromTo/Web-EK”
The previous English-Korean MT system that have been developed in Korea have dealt with only written text as translation object. Most of them enumerated a following list of the problems that had not seemed to be easy to solve in the near future : 1) processing of non-continuous idiomatic expressions 2) reduction of too many POS or structural ambiguities 3) robust processing for long sentence and parsing failure 4) selecting correct word correspondence between several alternatives. The problems can be considered as important factors that have influence on the translation quality of machine translation system. This paper describes not only the solutions of problems of the previous English-to-Korean machine translation systems but also the HTML tags management between two structurally different languages, English and Korean. Through the solutions we translate successfully English web documents into Korean one in the English-to-Korean web translator "FromTo/Web-EK" which has been developed from 1997.
ALTFLASH: a Japanese-to-English machine translation system for market flash reports
We have developed a Japanese-to-English machine translation system for market flash reports called ALTFLASH. ALTFLASH is a hybrid translation system based on a combination of rule-based translation and template-based translation systems. The experimental results were that the system could achieve good translation for 90% of source sentences (70% of articles) in reports on the foreign section of the Tokyo Stock Exchange. In addition, we focused on account settlement flashes, which formed fixed patterns, and developed a new system to translate them. This system has been installed, by Nihon Keizai Shimbun (Nikkei) in March 1998 in their English translation service for news flashes on settlements of accounts. It is a fully automatic translation system that enables news flashes to be broadcast to the world without requiring human intervention.
ALT-J/M a prototype Japanese-to-Malay translation system
In this report we introduce ALT-J/M: a prototype Japanese-to-Malay translation system. The system is a semantic transfer based system that uses the same translation engine as ALT-J/E, a Japanese-to-English system.
An automated mandarin document revision system using both phonetic and radical approaches
The pitfalls and complexities of Chinese to Chinese conversion
From To K/E: a Korean-English machine translation system based on idiom recognition and fail softening
In this paper we describe and experimentally evaluate FromTo K/E, a rule-based Korean-English machine translation system adapting transfer methodology. In accordance with the view that a successful Korean-English machine translation system presumes a highly efficient robust Korean parser, we develop a parser reinforced with "Fail Softening", i.e. the long sentence segmentation and the recovery of failed parse trees. To overcome the language-typological differences between Korean and English, we adopt a powerful module for processing Korean multi-word lexemes and Korean idiomatic expressions. Prior to parsing Korean sentences, furthermore, we try to resolve the ambiguity of words with unknown grammatical functions on the basis of the collocation and subcategorization information. The results of the experimental evaluation show that the degree of understandability for sample 2000 sentences amounts to 2.67, indicating that the meaning of the translated English sentences is almost clear to users, but the sentences still include minor grammatical or stylistic errors up to max. 30% of the whole words.
Transfer-based Japanese-Chinese translation implemented on an e-mail system
A Cantonese-English machine translation system PolyU-MT-99
WEBTRAN: a controlled language machine translation system for building multilingual services on Internet
Improvement of translation quality of English newspaper headlines by automatic preediting
Since the headlines of English news articles have a characteristic style, different from the styles which prevail in ordinary sentences, it is difficult for MT systems to generate high quality translations for headlines. We try to solve this problem by adding to an existing system a preediting module which rewrites the headlines to ordinary expressions. Rewriting of headlines makes it possible to generate better translations which would not otherwise be generated, with little or no changes to the existing parts of the system. Focusing on the absence of a form of the verb of 'be', we have described rewriting rules for putting properly the verb 'be' into the headlines.
Transfer in experience-guided machine translation
Experience-Guided Machine Translation (EGMT) seeks to represent the translators' knowledge of translation as experiences and translates by analogy. The transfer in EGMT finds the experiences most similar to a new text and its parts, segments it into units of translation and translates them by analogy to the experiences and then assembles them into a whole. A research prototype of analogical transfer from Chinese to English is built to prove the viability of the approach in the exploration of new architecture of machine translation. The paper discusses how the experiences are represented and selected with respect to a new text. It describes how units of translation are defined, partial translation is derived and composed into a whole.
Example-based machine translation of part-of-speech tagged sentences by recursive division
Example-Based Machine Translation can be applied to languages whose resources like dictionaries, reliable syntactic analyzers are hardly available because it can learn from new translation examples. However, difficulties still remain in translation of sentences which are not fully covered by the matching sentence. To solve that problem, we present in this paper a translation method which recursively divides a sentence and translates each part separately. In addition, we evaluate an analogy-based word-level alignment method which predicts word correspondences between source and translation sentences of new translation examples. The translation method was implemented in a French-Japanese machine translation system and spoken language text were used as examples. Promising translation results were earned and the effectiveness of the alignment method in the translation was confirmed.
A new way to conceptual meaning representation
Using computational semantics for Chinese translations
Sources of linguistic knowledge for minority languages
Harold L. Somers
Language Engineering (LE) products and resources for the world’s “major” languages are steadily increasing, but there remains a major gap as regards less widely-used languages. This paper considers the current situation regarding LE resources for some of the languages in question, and some proposals for rectifying this situation are made, including techniques based on adapting existing resources and “knowledge extraction” techniques from machine-readable corpora.
BITS: a method for bilingual text search over the Web
Mark Y. Liberman
Parallel corpus are valuable resource for machine translation, multi-lingual text retrieval, language education and other applications, but for various reasons, its availability is very limited at present. Noticed that the World Word Web is a potential source to mine parallel text, researchers are making their efforts to explore the Web in order to get a big collection of bitext. This paper presents BITS (Bilingual Internet Text Search), a system which harvests multilingual texts over the World Wide Web with virtually no human intervention. The technique is simple, easy to port to any language pairs, and with high accuracy. The results of the experiments on German-English pair proved that the method is very successful.
Sharing syntactic structures
Bracketed corpora are a very useful resource for natural language processing, but hard to build efficiently, leading to quantitative insufficiency for practical use. Disparities in morphological information, such as word segmentation and part-of-speech tag sets, are also troublesome. An application specific to a particular corpus often cannot be applied to another corpus. In this paper, we sketch out a method to build a corpus that has a fixed syntactic structure but varying morphological annotation based on the different tag set schemes utilized. Our system uses a two layered grammar, one layer of which is made up of replaceable tag-set-dependent rules while the other has no such tag set dependency. The input sentences of our system are bracketed corresponding to structural information of corpus. The parser can work using any tag set and grammar, and using the same input bracketing, we obtain corpus that shares partial syntactic structure.
A deterministic dependency parser for Japanese
We present a rule-based, deterministic dependency parser for Japanese. It was implemented in C++, using object classes that reflect linguistic concepts and thus facilitate the transfer of linguistic intuitions into code. The parser first chunks morphemes into one-word phrases and then parses from the right to the left. The average parsing accuracy is 83.6%.
A new approach to the translating telephone
The Translating Telephone has been a major goal of speech translation for many years. Previous approaches have attempted to work from limited-domain, fully-automatic translation towards broad-coverage, fully-automatic translation. We are approaching the problem from a different direction: starting with a broad-coverage but not fully-automatic system, and working towards full automation. We believe that working in this direction will provide us with better feedback, by observing users and collecting language data under realistic conditions, and thus may allow more rapid progress towards the same ultimate goal. Our initial approach relies on the wide-spread availability of Internet connections and web browsers to provide a user interface. We describe our initial work, which is an extension of the Diplomat wearable speech translator.
A method of evaluation of the quality of translated text
In this paper, I present a method for the evaluation of the quality of translated text, namely, a translation ability index, which shows the relative position of the translation ability of a Machine Translation (MT) system on a measurement scale. The measurements are made by an analysis ratio which is able to make absolute measurements and a conversion and magnitude scale (CGMS) which indicates the mutual relation of the machine translated text to the text translated by the professional human translator. The translation ability index in this work has been confirmed by the evaluation of two MT systems. This is effective as a clear explanation of this work.
Quantitative evaluation of machine translation using two-way MT
One of the most important issues in the field of machine translation is evaluation of the translated sentences. This paper proposes a quantitative method of evaluation for machine translation systems. The method is as follows. First, an example sentence in Japanese is machine translated into English using several Japanese-English machine translation systems. Second, the output English sentences are machine translated into Japanese using several English-Japanese machine translation systems (different from the Japanese-English machine translation systems). Then, each output Japanese sentence is compared with the original Japanese sentence in terms of word identification, correctness of the modification, syntactic dependency, and parataxes. An average score is calculated, and this becomes the total evaluation of the machine translation of the sentence. From this two-way machine translation and the calculation of the score, we can quantitatively evaluate the English machine translation. For the present study, we selected 100 Japanese sentences from the abstracts of scientific articles. Each of these sentences has an English translation which was performed by a human. Approximately half of these sentences are evaluated and the results are given. In addition, a comparison of human and machine translations is also performed and the trade-off between the two methods of translation is discussed.
Task-based evaluation for machine translation
Jennifer B. Doyon
Kathryn B. Taylor
John S. White
In an effort to reduce the subjectivity, cost, and complexity of evaluation methods for machine translation (MT) and other language technologies, task-based assessment is examined as an alternative to metrics-based in human judgments about MT, i.e., the previously applied adequacy, fluency, and informativeness measures. For task-based evaluation strategies to be employed effectively to evaluate languageprocessing technologies in general, certain key elements must be known. Most importantly, the objectives the technology’s use is expected to accomplish must be known, the objectives must be expressed as tasks that accomplish the objectives, and then successful outcomes defined for the tasks. For MT, task-based evaluation is correlated to a scale of tasks, and has as its premise that certain tasks are more forgiving of errors than others. In other words, a poor translation may suffice to determine the general topic of a text, but may not permit accurate identification of participants or the specific event. The ordering of tasks according to their tolerance for errors, as determined by actual task outcomes provided in this paper, is the basis of a scale and repeatable process by which to measure MT systems that has advantages over previous methods.
Experiment report of a commercial machine translation in a manufacturing industry domain
The aim of this paper is to provide a report of an experiment using a Commercial Machine Translation (CMT) software in a manufacturing company in the UK, with particular reference to Japanese / English Machine translation. It presents the main difficulties involved in the translation of industrial documents from Japanese to English, and discusses how the productivity and quality of translation can be improved through the use of commercial Machine Translation (MT) software. This is an empirical data and is proposed translators' point of view in a manufacturing factory. The survey focuses on a manufacturing organisation which does not have the resources needed to develop their own MT system. The globalization of the Japanese manufacturing industry makes it necessary for the translation of manuals and other documents to be as rapid as possible. In this paper, linguistic features of both English and Japanese are discussed from the evaluation experiment in order to make up writing rules for members of staff at Makita Manufacturing Europe. It also discusses viewpoint of British engineers when translated manuals are read.
Collection of dictionary data through Internet translation service
We have developed an Internet translation service, which we began to provide in 1997 for English to Japanese translation and in 1998 for Japanese to English. In this service, users send a translation request from a web page and receive by e-mail the result of the translation outputted by Toshiba’s machine translation system. As in other similar services, users can specify English-Japanese word pairs(dictionary data) when making a translation request. What distinguishes our service from others is that our service system constructs users’ own dictionaries on the server and helps them with this work by extracting words which the system expects to improve the system's translation quality if included in the dictionaries. With this function, users can efficiently add new word pairs so as to upgrade their own dictionaries when requesting re-translation. The dictionary data thus obtained from users can be utilized to improve the system dictionary on the server also.
Term Builder: a lexical knowledge acquisition tool for the Logos machine translation system
Logos 8, the next generation of the Logos Machine Translation (MT) system, is a client server application, which realizes the latest advances in system design and architecture. A multi-user, networkable application, Logos 8 allows Internet or Intranet use of its applications with client interfaces that communicate with dictionaries and translation servers through a common gateway. The new Logos 8 technology is based on a relational database for storage and organization of the lexical data. In this paper, we present Term-Builder, the Lexical Knowledge Acquisition tool developed for Logos 8. The new automatic coding functionality within Term-Builder is significantly improving the process of acquiring new lexicons for MT and other applications.
Computer-aided translation tools for Russian. Ukrainian, and English
Michael S. Blekhman
The paper presents a new development by Lingvistica ‘98 Inc.: the PG-PARS computer-assisted translation system and a series of professional bidirectional dictionaries. PG-PARS was designed as a Windows 95 and Windows 98 application to support English-Russian-English and English-Ukrainian-English dictionaries. PG-PARS dictionaries are all bidirectional, i.e. they include, for example, both the English-Russian and Russian-English parts. Each part has its alphabetical index of entries displayed on a separate tab in the PG-PARS main window. The word entries are displayed for both parts of the dictionary, and translations of translations can be found easily, which is useful for a professional translator. One of the main peculiarities of PG-PARS is the Smart search mode based on morphological analysis of Slavic and English words looked up in the dictionaries. This is especially beneficial if the user is not a native speaker of Russian and Ukrainian. Another important feature is the Selection option which allows the user to mark a portion of the dictionary entry and paste it into the text. The presentation will show professional applications of the PG-PARS system for translating Russian and English texts.
In this paper, we propose a camera system which translates Japanese texts in a scene. The system is portable and consists of four components: digital camera, character image extraction process, character recognition process, and translation process. The system extracts character strings from a region which a user specifies, and translates them into English.
A new diagnostic system for J-E translation ILTS by global matching algorithm and POST parser
A new diagnostic system has been developed for an interactive template-structured intelligent language tutoring system (ILTS) for Japanese-English translation where an efficient heaviest common sequence (HCS) matching algorithm and a ‘part-of-speech tagged (POST) parser’ play a key role. This is implemented by exploiting the system template which consists of a complex transition networks comprising both model (correct) translations and many typical erroneous translations characteristic of nonnative beginners all collected and extracted from translations of about 200 monitors. By selecting, from among many candidates’ paths in the system template, a path having a HCS with the student’s input translation as a best matched sentence, the template structure of the diagnostic system allows the potentially complicated bug finding processes in natural language to be implemented by a much simpler and efficient HCS string matching algorithm . To improve the precision of a parser, we have developed a ‘probabilistic POST parser’ where we have eliminated ambiguity in part-of-speeches by manually pre-assigning POS tags to all words in potentially correct paths of the template. Combining the templatebased diagnostic system and the parser, we found that the ILTS is capable of providing most adequate diagnostic messages and a tutoring strategy with appropriate comments after analyzing the keyed-in translated sentences from students.
Linking translation memories with example-based machine translation
The paper reports on experiments which compare the translation outcome of three corpus-based MT systems, a string-based translation memory (STM), a lexeme-based translation memory (LTM) and the example-based machine translation (EBMT) system EDGAR. We use a fully automatic evaluation method to compare the outcome of each MT system and discuss the results. We investigate the benefits for the linkage of different MT strategies such as TMsystems and EBMT systems.
Towards an interlingual treatment of modality
Modality is an important, but complex linguistic phenomenon that concerns all levels of language production. NLP research has rather refrained from this subject, but we show that many errors in machine translation systems are directly related to the absence of a proper interlingual treatment of modality. We outline the traces of such a modal interlingua by presenting the “Module of Modality”, parts of which are currently being implemented in a Japanese-English system.
Resolving category ambiguity of non-text symbols in Mandarin text
Automatic domain recognition for machine translation
Elke D. Lange
This paper describes an ongoing project which has the goal of improving machine translation quality by increasing knowledge about the text to be translated. A basic piece of such knowledge is the domain or subject field of the text. When this is known, it is possible to improve meaning selection appropriate to that domain. Our current effort consists in automating both recognition of the text’s domain and the assignment of domain-specific translations. Results of our implementation show that the approach of using terminology categorization already existing in the machine translation system is very promising.
A multilevel framework for incremental development of MT systems
We describe a Machine Translation framework aimed at the rapid development of large scale robust machine translation systems for assimilation purposes, where the MT system is incorporated as one of the tools in an analyst’s workstation. The multilevel architecture of the system is designed to enable early delivery of functional translation capabilities and incremental improvement of quality. A crucial aspect of the framework is a careful articulation of a software architecture, a linguistic architecture and an incremental development process of linguistic knowledge.