Conference of the Association for Machine Translation in the Americas (2002)
- Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions 2 papers
- Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers 18 papers
- Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: User Studies 3 papers
- Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: System Descriptions 9 papers
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers
Machine Translation of minority languages presents unique challenges, including the paucity of bilingual training data and the unavailability of linguistically-trained speakers. This paper focuses on a machine learning approach to transfer-based MT, where data in the form of translations and lexical alignments are elicited from bilingual speakers, and a seeded version-space learning algorithm formulates and refines transfer rules. A rule-generalization lattice is defined based on LFG-style f-structures, permitting generalization operators in the search for the most general rules consistent with the elicited data. The paper presents these methods and illustrates examples.
In this paper we present a model for the future use of Machine Translation (MT) and Computer Assisted Translation. In order to accommodate the future needs in middle value translations, we discuss a number of MT techniques and architectures. We anticipate a hybrid environment that integrates data- and rule-driven approaches where translations will be routed through the available translation options and consumers will receive accurate information on the quality, pricing and time implications of their translation choice.
We present a new approach to the problem of aligning English and Chinese sentences in a bilingual corpus based on adaptive learning. While using length information alone produces surprisingly good results for aligning bilingual French and English sentences with success rates well over 95%, it does not fair as well for the alignment of English and Chinese sentences. The crux of the problem lies in greater variability of lengths and match types of the matched sentences. We propose to cope with such variability via a two-pass scheme under which model parameters can be learned from the data at hand. Experiments show that under the approach bilingual English-Chinese texts can be aligned effectively across diverse domains, genres and translation directions with accuracy rates approaching 99%.
The frequent occurrence of divergenceS—structural differences between languages—presents a great challenge for statistical word-level alignment. In this paper, we introduce DUSTer, a method for systematically identifying common divergence types and transforming an English sentence structure to bear a closer resemblance to that of another language. Our ultimate goal is to enable more accurate alignment and projection of dependency trees in another language without requiring any training on dependency-tree data in that language. We present an empirical analysis comparing the complexities of performing word-level alignments with and without divergence handling. Our results suggest that our approach facilitates word-level alignment, particularly for sentence pairs containing divergences.
Text prediction is a form of interactive machine translation that is well suited to skilled translators. In recent work it has been shown that simple statistical translation models can be applied within a usermodeling framework to improve translator productivity by over 10% in simulated results. For the sake of efficiency in making real-time predictions, these models ignore the alignment relation between source and target texts. In this paper we introduce a new model that captures fuzzy alignments in a very simple way, and show that it gives modest improvements in predictive performance without significantly increasing the time required to generate predictions.
Maximum entropy (ME) models have been successfully applied to many natural language problems. In this paper, we show how to integrate ME models efficiently within a maximum likelihood training scheme of statistical machine translation models. Specifically, we define a set of context-dependent ME lexicon models and we present how to perform an efficient training of these ME models within the conventional expectation-maximization (EM) training of statistical translation models. Experimental results are also given in order to demonstrate how these ME models improve the results obtained with the traditional translation models. The results are presented by means of alignment quality comparing the resulting alignments with manually annotated reference alignments.
In the IBM LMT Machine Translation (MT) system, a built-in strategy provides lexical coverage of a particular subset of words that are not listed in its bilingual lexicons. The recognition and coding of these words and their transfer generation is based on a set of derivational morphological rules. A new utility extends unfound words of this type in an LMT-compatible format in an auxiliary bilingual lexical file to be subsequently merged into the core lexicons. What characterizes this approach is the use of morphological, semantic, and syntactic features for both analysis and transfer. The auxiliary lexical file (ALF) has to be revised before a merge into the core lexicons. This utility integrates a linguistics-based analysis and transfer rules with a corpus-based method of verifying or falsifying linguistic hypotheses against extensive document translation, which in addition yields statistics on frequencies of occurrence as well as local context.
One of the limitations of translation memory systems is that the smallest translation units currently accessible are aligned sentential pairs. We propose an example-based machine translation system which uses a ‘phrasal lexicon’ in addition to the aligned sentences in its database. These phrases are extracted from the Penn Treebank using the Marker Hypothesis as a constraint on segmentation. They are then translated by three on-line machine translation (MT) systems, and a number of linguistic resources are automatically constructed which are used in the translation of new input. We perform two experiments on testsets of sentences and noun phrases to demonstrate the effectiveness of our system. In so doing, we obtain insights into the strengths and weaknesses of the selected on-line MT systems. Finally, like many example-based machine translation systems, our approach also suffers from the problem of ‘boundary friction’. Where the quality of resulting translations is compromised as a result, we use a novel, post hoc validation procedure via the World Wide Web to correct imperfect translations prior to their being output to the user.
This paper describes a novel approach to handling translation divergences in a Generation-Heavy Hybrid Machine Translation (GHMT) system. The translation divergence problem is usually reserved for Transfer and Interlingual MT because it requires a large combination of complex lexical and structural mappings. A major requirement of these approaches is the accessibility of large amounts of explicit symmetric knowledge for both source and target languages. This limitation renders Transfer and Interlingual approaches ineffective in the face of structurally-divergent language pairs with asymmetric resources. GHMT addresses the more common form of this problem, source-poor/targetrich, by fully exploiting symbolic and statistical target-language resources. This non-interlingual non-transfer approach is accomplished by using target-language lexical semantics, categorial variations and subcategorization frames to overgenerate multiple lexico-structural variations from a target-glossed syntactic dependency of the source-language sentence. The symbolic overgeneration, which accounts for different possible translation divergences, is constrained by a statistical target-language model.
This paper describes our ongoing project “Korean-Chinese Machine Translation System”. The main knowledge of our system is verb patterns. Each verb can have several meanings and each meaning of a verb is represented by a verb pattern. A verb pattern consists of a source language pattern part for the analysis and the corresponding target language pattern part for the generation. Each pattern part, according to the degree of generality, contains lexical or semantic information for the arguments or adjuncts of each verb meaning. In this approach, accurate analysis can directly lead to natural and correct generation. Furthermore as the transfer mainly depends upon verb patterns, the translation rate is expected to go higher, as the size of verb pattern grows larger.
Despite the exciting work accomplished over the past decade in the field of Statistical Machine Translation (SMT), we are still far from the point of being able to say that machine translation fully meets the needs of real-life users. In a previous study , we have shown how a SMT engine could benefit from terminological resources, especially when translating texts very different from those used to train the system. In the present paper, we discuss the opening of SMT to examples automatically extracted from a Translation Memory (TM). We report results on a fair-sized translation task using the database of a commercial bilingual concordancer.
We present a classification approach to building a English-Korean machine translation (MT) system. We attempt to build a word-based MT system from scratch using a set of parallel documents, online dictionary queries, and monolingual documents on the web. In our approach, MT problem is decomposed into two sub-problems — word selection problem and word ordering problem of the selected words. In this paper, we will focus on the word selection problem and discuss some preliminary results.
One of the problems facing translation systems that automatically extract transfer mappings (rules or examples) from bilingual corpora is the trade-off between contextual specificity and general applicability of the mappings, which typically results in conflicting mappings without distinguishing context. We present a machine-learning approach to choosing between such mappings, using classifiers that, in effect, selectively expand the context for these mappings using features available in a linguistic representation of the source language input. We show that using these classifiers in our machine translation system significantly improves the quality of the translated output. Additionally, the set of distinguishing features selected by the classifiers provides insight into the relative importance of the various linguistic features in choosing the correct contextual translation.
We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence-based methods are generally more accurate but much slower, and usually depend on cognates or a bilingual lexicon. Our method adapts and combines these approaches, achieving high accuracy at a modest computational cost, and requiring no knowledge of the languages or the corpus beyond division into words and sentences.
This paper describes the results of a feasibility study which focused on deriving semantic networks from descriptive texts using controlled language. The KANT system [3,6] was used to analyze input paragraphs, producing sentence-level interlingua representations. The interlinguas were merged to construct a paragraph-level representation, which was used to create a semantic network in Conceptual Graph (CG)  format. The interlinguas are also translated (using the KANTOO generator) into OWL statements for entry into the Ontology Works electrical power factbase . The system was extended to allow simple querying in natural language.
The existence of a phrase in a large monolingual corpus is very useful information, and so is its frequency. We introduce an alternative approach to automatic translation of phrases/sentences that operationalizes this observation. We use a statistical machine translation system to produce alternative translations and a large monolingual corpus to (re)rank these translations. Our results show that this combination yields better translations, especially when translating out-of-domain phrases/sentences. Our approach can be also used to automatically construct parallel corpora from monolingual resources.
For the purpose of overcoming resource scarcity bottleneck in corpus-based translation knowledge acquisition research, this paper takes an approach of semi-automatically acquiring domain specific translation knowledge from the collection of bilingual news articles on WWW news sites. This paper presents results of applying standard co-occurrence frequency based techniques of estimating bilingual term correspondences from parallel corpora to relevant article pairs automatically collected from WWW news sites. The experimental evaluation results are very encouraging and it is proved that many useful bilingual term correspondences can be efficiently discovered with little human intervention from relevant article pairs on WWW news sites.
The cumulative effort over the past few decades that have gone into developing linguistic resources for tasks ranging from machine readable dictionaries to translation systems is enormous. Such effort is prohibitively expensive for languages outside the (largely) European family. The possibility of building such resources automatically by accessing electronic corpora of such languages are therefore of great interest to those involved in studying these ‘new’ - ‘lesser known’ languages. The main stumbling block to applying these data driven techniques directly is that most of them require large corpora rarely available for such ‘new’ languages. This paper describes an attempt at setting up a bootstrapping agenda to exploit the scarce corpus resources that may be available at the outset to a researcher concerned with such languages. In particular it reports on results of an experiment to use state-of-the-art data-driven techniques for building linguistic resources for Sinhala - a non-European language with virtually no electronic resources.
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: User Studies
This paper describes the process of implementing a machine translation system (MT system) and the problems and pitfalls encountered within this process at CLS Corporate Language Services AG, a language solutions provider for the Swiss financial services industry, in particular UBS AG and Zurich Financial Services. The implementation was based on the perceived requirements of large organizations, which is why the focus was more on practical rather than academic aspects. The paper can be roughly divided into three parts: (1) definition of the implementation process, co-ordination and execution, (2) implementation plan and customer/user management, (3) monitoring of the MT system and related maintenance after going live.
Most large companies are very good at “getting the message out” –publishing reams of announcements and documentation to their employees and customers. More challenging by far is “getting the message in” – ensuring that these messages are read, understood, and acted upon by the recipients. This paper describes NCR Corporation’s experience with the selection and implementation of a machine translation (MT) system in the Global Learning division of Human Resources. The author summarizes NCR‘s vision for the use of MT, the competitive “fly-off” evaluation process he conducted in the spring of 2000, the current MT production environment, and the reactions of the MT users. Although the vision is not yet fulfilled, progress is being made. The author describes NCR’s plans to extend its current MT architecture to provide real-time translation of web pages and other intranet resources.
For over ten years, Ford Vehicle Operations has utilized an Artificial Intelligence (AI) system to assist in the creation and maintenance of process build instructions for our vehicle assembly plants. This system, known as the Direct Labor Management System, utilizes a restricted subset of English called Standard Language as a tool for the writing of process build instructions for the North American plants. The expansion of DLMS beyond North America as part of the Global Study Process Allocation System (GSPAS) required us to develop a method to translate these build instructions from English to other languages. This Machine Translation process, developed in conjunction with SYSTRAN, has allowed us to develop a system to automatically translate vehicle assembly build instructions for our plants in Europe and South America.
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: System Descriptions
This paper presents a generalized description of the characteristics and implications of two processes that enable Fluent Machines’ machine translation system, called EliMT (a term coined by Dr. Jamie Carbonell after the system’s inventor, Eli Abir). These two processes are (1) an automated cross-language database builder and (2) an n-gram connector.
LogoMedia Corporation offers a new multilingual machine translation system – LogoMedia Translate – based upon smaller applications, called “applets”, designed to perform a small group of related tasks and to provide services to other applets. Working together, applets provide comprehensive solutions that are more effective, easier to implement, and less costly to maintain. Version 2, released in 2002, provides a single set of cooperating user interfaces and translation engines from 6 vendors for English to and from Chinese (Simplified and Traditional) Japanese, Korean, French, Italian, German, Spanish, Portuguese, Russian, Polish, and Ukrainian.
Any-Language Communications has developed a novel semantics-oriented pre-market prototype system, based on the Theory of Universal Grammar, that uses the innate relationships of the words in a sensible sentence (the natural intelligence) to determine the true contextual meaning of all the words. The system is built on a class/category structure of language concepts and includes a weighted inheritance system, a number language word conversion, and a tailored genetic algorithm to select the best of the possible word meanings. By incorporating all of the language information within the dictionaries, the same semantic processing code is used to interpret any language. This approach is suitable for machine translation (MT), sophisticated text mining, and artificial intelligence applications. An MT system has been tested with English, French, German, Hindi, and Russian. Sentences for each of those languages have been successfully interpreted and proper translations generated.
Pre-market prototype - to be available commercially in the second or third quarter of 2003.
This paper presents a description of the well-known family of machine translation systems, PARS. PARS was developed in the USSR as long ago as in 1989, and, since then, it has passed a difficult way from a mainframe-based, somewhat bulky system to a modern PC-oriented product. At the same time, we understand but well that, as any machine translation software, PARS is not artificial intelligence, and it is only capable of generating what is called “draft translation”. It is certainly useful, but can by no means be considered a kind of substitution for a human translator whenever high-quality translation is required.
MSR-MT is an advanced research MT prototype that combines rule-based and statistical techniques with example-based transfer. This hybrid, large-scale system is capable of learning all its knowledge of lexical and phrasal translations directly from data. MSR-MT has undergone rigorous evaluation showing that, trained on a corpus of technical data similar to the test corpus, its output surpasses the quality of best-of-breed commercial MT systems.
NESPOLE! is a speech-to-speech machine translation research system designed to provide fully functional speech-to-speech capabilities within real-world settings of common users involved in e-commerce applications. The project is funded jointly by the European Commission and the US NSF. The NESPOLE! system uses a client-server architecture to allow a common user, who is browsing web-pages on the internet, to connect seamlessly in real-time to an agent of the service provider, using a video-conferencing channel and with speech-to-speech translation services mediating the conversation. Shared web pages and annotated images supported via a Whiteboard application are available to enhance the communication.
We will present the KANTOO machine translation environment, a set of software servers and tools for multilingual document production. KANTOO includes modules for source language analysis, target language generation, source terminology management, target terminology management, and knowledge source development (see Figure 1).
The paper discusses a number of important issues in speech-to-speech translation, including the key issue of level of integration of all components of such systems, based on our experience in the field since 1990. Section 1 discusses dimensions of the spoken translation problem, while current and near term approaches to spoken translation are treated in Sections 2 and 3. Section 2 describes our current expectation-based, speaker-independent, two-way translation systems, and Section 3 presents the advanced translation engine under development for handling spontaneous dialogs.