Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers
We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated.
This article reviews some recently invented methods for automatically extracting translation lexicons from parallel texts. The accuracy of these methods has been significantly improved by exploiting known properties of parallel texts and of particular language pairs. The state of the art has advanced to the point where non-compositional compounds can be automatically identified with high reliability, and their translations can be found. Most importantly, all of these methods can be smoothly integrated into the usual work ow of MT system developers. Semi-automatic MT lexicon construction is likely to be more efficient and more accurate than either fully automatic or fully manual methods alone.
The MT engine of the JANUS speech-to-speech translation system is designed around four main principles: 1) an interlingua approach that allows the efficient addition of new languages, 2) the use of semantic grammars that yield low cost high quality translations for limited domains, 3) modular grammars that support easy expansion into new domains, and 4) efficient integration of multiple grammars using multi-domain parse lattices and domain re-scoring. Within the framework of the C-STAR-II speech-to-speech translation effort, these principles are tested against the challenge of providing translation for a number of domains and language pairs with the additional restriction of a common interchange format.
This paper describes a refinement to our procedure for porting lexical conceptual structure (LCS) into new languages. Specifically we describe a two-step process for creating candidate thematic grids for Mandarin Chinese verbs, using the English verb heading the VP in the subde_nitions to separate senses, and roughly parsing the verb complement structure to match thematic structure templates. We accomplished a substantial reduction in manual effort, without substantive loss. The procedure is part of a larger process of creating a usable lexicon for interlingual machine translation from a large on-line resource with both too much and too little information.
TTL (Translation Template Learner) algorithm learns lexical level correspondences between two translation examples by using analogical reasoning. The sentences used as translation examples have similar and different parts in the source language which must correspond to the similar and different parts in the target language. Therefore these correspondences are learned as translation templates. The learned translation templates are used in the translation of other sentences. However, we need to assign confidence factors to these translation templates to order translation results with respect to previously assigned confidence factors. This paper proposes a method for assigning confidence factors to translation templates learned by the TTL algorithm. Training data is used for collecting statistical information that will be used in confidence factor assignment process. In this process, each template is assigned a confidence factor according to the statistical information obtained from training data. Furthermore, some template combinations are also assigned confidence factors in order to eliminate certain combinations resulting bad translation.
The speech-to-speech translation system Verbmobil integrates deep and shallow analysis modules that produce linguistic representations in parallel. Thus, the input representations for the transfer module differ with respect to their depth and quality. This gives rise to two problems: (i) the transfer database has to be adjusted according to input quality, and (ii) translations produced have to be ranked with respect to their quality in order to select the most appropriate result. This paper presents an operationalized solution to both problems.
Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic dificulty of locating parallel texts in all but the most dominant of the world’s languages. A parallel corpus resource not yet explored is the World Wide Web, which hosts an abundance of pages in parallel translation, offering a potential solution to some of these problems and unique opportunities of its own. This paper presents the necessary first step in that exploration: a method for automatically finding parallel translated documents on the Web. The technique is conceptually simple, fully language independent, and scalable, and preliminary evaluation results indicate that the method may be accurate enough to apply without human intervention.
This paper describes the integration of a Turkish generation system with the KANT knowledge-based machine translation system to produce a prototype English-Turkish interlingua-based machine translation system. These two independently constructed systems were successfully integrated within a period of two months, through development of a module which maps KANT interlingua expressions to Turkish syntactic structures. The combined system is able to translate completely and correctly 44 of 52 benchmark sentences in the domain of broadcast news captions. This study is the first known application of knowledge-based machine translation from English to Turkish, and our initial results show promise for future development.
This paper reports on an experiment in assembling a domain-specific machine translation prototype system from off-the-shelf components. The design goals of this experiment were to reuse existing components, to use machine-learning techniques for parser specialization and for transfer lexicon extraction, and to use an expressive, lexicalized formalism for the transfer component.
The Multi-Engine MT (MEMT) architecture combines the outputs of multiple MT engines using a statistical language model of the target language. It has been used successfully in a number of MT research systems, for both text and speech translation. Despite its perceived benefits, there has never been a rigorous, published, double-blind evaluation of the claim that the combined output of a MEMT system is in fact better than that of any one of the component MT engines. We report here the results of such an evaluation. The combined MEMT output is shown to indeed be better overall than the output of the component engines in a Croatian ↔ English MT system. This result is consistent in both translation directions, and between different raters.
The main problem with natural language analysis is the ambiguity found in various levels of linguistic information. Syntactic analysis with word senses is frequently not enough to resolve all ambiguities found in a sentence. Although natural languages are highly connected to the real world knowledge, most of the parsing architectures do not make use of it effectively. In this paper, a new methodology is proposed for analyzing Turkish sentences which is heavily based on the constraints in the ontology. The methodology also makes use of morphological marks of Turkish which generally denote semantic properties. Analysis aims to find the propositional structure of the input utterance without constructing a deep syntactic tree, instead it utilizes a weak interaction between syntax and semantics. The architecture constructs a specific meaning representation on top of the analyzed propositional structure.
Although the problem of full machine translation (MT) is unsolved yet, the computer aided translation (CAT) makes progress. In this field we created a work environment for monolingual translator 1. This package of tools generally enables a user who masters a source language to translate texts to a target language which the user does not master. The application is for Hebrew-to-Russian case, emphasizing specific problems of these languages, but it can be adapted for other pairs of languages also. After Source Text Preparation, Morphological Analysis provides all the meanings for every word. The ambiguity problem is very serious in languages with incomplete writing, like Hebrew. But the main problem is the translation itself. Words’ meanings mapping between languages is M:M, i.e., almost every source word has a number of possible translations, and almost every target word can be a translation of several words. Many methods for resolving of these ambiguities propose using large data bases, like dictionaries with semantic fields based on θ-theory. The amount of information needed to deal with general texts is prohibitively large. We propose here to solve ambiguities by a new method: Accumulation with Inversion and then Weighted Selection, plus Learning, using only two regular dictionaries: from source to target and from target to source languages. The method is built from a number of phases: (1) during Accumulation with Inversion, all the possible translations to the target language of every word are brought, and every one of them is translated back to the source language; (2) Selection of suitable suggestions is being made by user in source language, this is the only manual phase; (3) Weighting of the selection’s results is being made by software and determines the most suitable translation to the target language; (4) Learning of word’s context will provide preferable translation in the future. Target Text Generation is based on morphological records in target language, that are produced by the disambiguation phase. To complete the missing features for word’s building, we propose here a method of Features Expansion. This method is based on assumptions about feature flow through the sentence, and on dependence of grammatical phenomena in the two languages. Software of the workstation combines four tools: Source Text Preparation, Morphological Analysis, Disambiguation and Target Text Generation. The application includes an elaborated windows interface, on which the user’s work is based.
We describe a statistical algorithm for machine translation intended to provide translations of large document collections at speeds far in excess of traditional machine translation systems, and of sufficiently high quality to perform information retrieval on the translated document collections. The model is trained from a parallel corpus and is capable of disambiguating senses of words. Information retrieval (IR) experiments on a French language dataset from a recent cross-language information retrieval evaluation yields results superior to those obtained by participants in the evaluation, and confirm the importance of word sense disambiugation in cross-language information retrieval.
The Controlled Automotive Service Language project at General Motors is combining machine translation (MT) with a variety of other language technologies into an existing translation environment. In keeping with the theme of this conference, this report elaborates on the elements of this mixture, and how they are being blended together to form a coordinated whole. The primary concept is that machine translation cannot be viewed independently of the context in which it will be used. That entire context must be prepared and managed in order to accommodate MT without undue business risk. Further, until high-quality MT is available in a much wider variety of languages, any MT production application is likely to co-exist with traditional human translation, which requires additional considerations.
EasyEnglish is an authoring tool which is part of IBM’s internal SGML editing environment, Information Development Workbench. EasyEnglish is used as a preprocessing step for machine-translating IBM manuals. Although Easy English does some traditional grammar checking, its focus is on problems of structural ambiguity. Such problems include ambiguous attachment of participles, ambiguous scope in coordination, and ambiguous attachment of the agent phrase for double passives. Since we deal with truly ambiguous constructions, the system has no way of deciding on the desired interpretation; the system provides the user with a choice of rewriting suggestions, each forcing an unambiguous attachment. This paper describes the techniques for identifying structural ambiguities and generating unambiguous rewriting suggestions.
This paper addresses the problems of the so-called ‘Multiple-Subject Constructions’ in Korean-to-English and Korean-to-German MT. They are often encountered in a dialogue, so that they must be especially taken into account in designing a spoken-language translation system. They do not only raise questions about their syntactic and semantic nature but also cause such problems as structural changes in the MT. The proper treatment of these constructions is also of importance in constructing a multilingual MT-System, because they are one of the major characteristics which distinguish the so-called ‘topic-oriented’ languages such as Korean and Japanese from the ‘subject-oriented’ languages such as English and German. In this paper we employ linguistic knowledge such as subcategorization, linear precedence and lexical functions for the analysis and the transfer of the constructions of this sort. Using the proposed methods, the specific transfer-rules for each language pair can be avoided.
This paper describes a sentence alignment technique based on a machine readable dictionary. Alignment takes place in a single pass through the text, based on the scores of matches between pairs of source and target sentences. Pairings consisting of sets of matches are evaluated using a version of the Gale-Shapely solution to the stable marriage problem. An algorithm is described which can handle N-to-1 (or 1-to-N) matches, for n ≥ 0, i.e., deletions, 1-to-1 (including scrambling), and 1-to-many matches. A simple frequency based method for acquiring supplemental dictionary entries is also discussed. We achieve high quality alignments using available bilingual dictionaries, both for closely related language pairs (Spanish/English) and more distantly related pairs (Japanese/English).
Machine-readable dictionaries have been regarded as a rich knowledge source from which various relations in lexical semantics can be effectively extracted. These semantic relations have been found useful for supporting a wide range of natural language processing tasks, from information retrieval to interpretation of noun sequences, and to resolution of prepositional phrase attachment. In this paper, we address issues related to problems in building a semantic hierarchy from machine-readable dictionaries: genus disambiguation, discovery of covert categories, and bilingual taxonomy. In addressing these issues, we will discuss the limiting factors in dictionary definitions and ways of eradicating these problems. We will also compare the taxonomy extracted in this way from a typical MRD and that of the WordNet. We argue that although the MRD-derived taxonomy is considerably flatter than the WordNet, it nevertheless provides a functional core for a variety of semantic relations and inferences which is vital in natural language processing.
It is well known that Machine Translation (MT) has not approached the quality of human translations. It has also been noted that MT research has largely ignored the work of professionals and researchers in the field of translation, and that MT might benefit from collaboration with this field. In this paper, I look at a specialized type of translation, Simultaneous Interpretation (SI), in the light of possible applications to MT. I survey the research and practice of SI, and note that explanatory analyses of SI do not yet exist. However, descriptive analyses do, arrived at through anecdotal, empirical, and model-based methods. These descriptive analyses include “techniques” humans use for interpreting, and I suggest possible ways MT might use these techniques. I conclude by noting further questions which must be answered before we can fully understand SI, and how it might help MT.
Grammatically incorrect sentences result either from an unknown (possibly misspelled) word, an incorrect word order or even an omitted / redundant word. Sentences with these errors are a bottle-neck to NLP systems because they cannot be parsed correctly. Human beings are able to overcome this problem (either occurring in spoken or written language) since they are capable of doing a semantic similarity search to find out if a similar utterance has been heard before or a syntactic similarity search for a stored utterance that shares structural similarities with the input. If the syntactic and semantic analysis of the rest of the input can be done correctly, then a ‘gap’ that exists in the utterance, can be uniquely identified. In this paper, a system named SAUCOLA which is based on a concept lattice, that mimics human skills in resolving knowledge gaps that exist in written language is presented. The preliminary results show that correct stored sentences can be retrieved based on the words contained in the incorrect input sentence.
Due to the explosive growth of the WWW, very large multilingual textual resources have motivated the researches in Cross-Language Information Retrieval and online Web Machine Translation. In this paper, the integration of language translation and text processing system is proposed to build a multilingual information system. A distributed English-Chinese system on WWW is introduced to illustrate how to integrate query translation, search engines, and web translation system. Since July 1997, more than 46,000 users have accessed our system and about 250,000 English web pages have been translated to pages in Chinese or bilingual English-Chinese versions. And the average satisfaction degree of users at document level is 67.47%.
Names can serve several purposes in the field of Machine Translation. The problems range from identifying to processing the various types of names. The paper begins with a short description of the search strategy and then continues with the classification of types into a typology. We present our findings according to degrees of translation from which we highlight clues. These clues indicate a first step towards formalization.
On December 9 1997, SYSTRAN and the AltaVista Search Network launched the first widely available, real-time, high-speed and free translation service on the Internet. This initial deployment, treated as a global experiment, has become a tremendous success. Through this service, machine translation (MT) technology has been pushed to the forefront of worldwide awareness. Besides growing media coverage, user response during the first five months has been overwhelming. This paper is a study of the user feedback from the MT developer’s perspective, addressing such questions as: Who are the users? What are their needs? What is their acceptance of MT? What types of texts are being translated? What suggestions do users offer? Finally, this paper outlines our view on opportunities and challenges, and on how to use this feedback to guide future development priorities.
We present an approach to semantic interpretation of syntactically parsed Japanese sentences that works largely parser-independent. The approach relies on a standardized parse tree format that restricts the number of syntactic configurations that the semantic interpretation rules have to anticipate. All parse trees are converted to this format prior to semantic interpretation. This setup allows us not only to apply the same set of semantic interpretation rules to output from different parsers, but also to independently develop parsers and semantic interpretation rules.
This paper presents the activities of Euromat (European Machine Translation) office in Greece, which has been functioning as a centre for Machine Translation Services for the Greek Public Sector since 1994. It describes the user profile, his/her attitude towards MT, strategies of promotion and the collected corpus for the first three years. User data were collected by questionnaires, interviews and corpus statistics. The general conclusions which have come out from our surveys are discussed.
In conventional approaches to Korean analysis, verb subcategorization has generally been used as lexical knowledge. A problem arises, however, when we are given long sentences in which two or more verbs of the same subcategorization are involved. In those sentences, a noun phrase may be taken as the constituent of more than one verb and cause an ambiguity. This paper presents an approach to solving this problem by using structural patterns acquired by a statistical method from corpora. Structural patterns can be the processing units for syntactic analysis and for translation into other languages as well. We have collected 10,686 unique structural patterns from a Korean corpus of 1.27 million words. We have analyzed 2,672 sentences and shown that structural patterns can improve the accuracy of Korean analysis.
We describe a streamlined knowledge acquisition method for semi-automatically constructing knowledge bases for a Knowledge Based Machine Translation (KBMT) system. This method forms the basis of a very simple Java-based user interface that enables a language expert to build lexical and syntactic transfer knowledge bases without extensive specialized training as an MT system builder. Following [Wu 1997], we assume that the permutation of binary-branching structures is a sufficient reordering mechanism for MT. Our syntactic knowledge is based on a novel, highly constrained grammar construction environment in which the only re-ordering mechanism is the permutation of binary-branching structures (Twisted Pair Grammar). We describe preliminary results for several fully implemented components of a Hindi/Urdu to English MT prototype being built with this interface.
This paper describes an implemented algorithm for syntactic realization of a target-language sentence from an interlingual representation called Lexical Conceptual Structure (LCS). We provide a mapping between LCS thematic roles and Abstract Meaning Representation (AMR) relations; these relations serve as input to an off-the-shelf generator (Nitrogen). There are two contributions of this work: (1) the development of a thematic hierarchy that provides ordering information for realization of arguments in their surface positions; (2) the provision of a diagnostic tool for detecting inconsistencies in an existing online LCS-based lexicon that allows us to enhance principles for thematic-role assignment.
We present a newly designed transformational system for the MT system LMT, consisting of a transformational formalism, LMT-TL, and an algorithm for applying transformations written in this formalism. LMT-TL is both expressive and simple because of the systematic use of a powerful pattern matching mechanism that focuses on dependency trees. LMT-TL is a language in its own right, with no “escapes” to underlying programming languages. We first provide an overview of the complete LMT translation process (all newly redesigned), and then give a self-contained description of LMT-TL, with examples.
A not-translated word (NTW) is a token which a machine translation (MT) system is unable to translate, leaving it untranslated in the output. The number of not-translated words in a document is used as one measure in the evaluation of MT systems. Many MT developers agree that in order to reduce the number of NTWs in their systems, designers must increase the size or coverage of the lexicon to include these untranslated tokens, so that the system can handle them in future processing. While we accept this method for enhancing MT capabilities, in assessing the nature of NTWs in real-world documents, we found surprising results. Our study looked at the NTW output from two commercially available MT systems (Systran and Globalink) and found that lexical coverage played a relatively small role in the words marked as not translated. In fact, 45% of the tokens in the list failed to translate for reasons other than that they were valid source language words not included in the MT lexicon. For instance, e-mail addresses, words already in the target language and acronyms were marked as not-translated words. This paper presents our analysis of NTWs and uses these results to argue that in addition to lexicon enhancement, MT systems could benefit from more sophisticated pre- and postprocessing of real-world documents in order to weed out such NTWs.
As part of the Machine Translation (MT) Proficiency Scale project at the US Federal Intelligent Document Understanding Laboratory (FIDUL), Litton PRC is developing a method to measure MT systems in terms of the tasks for which their output may be successfully used. This paper describes the development of a task inventory, i.e., a comprehensive list of the tasks analysts perform with translated material and details the capture of subjective user judgments and insights about MT samples. Also described are the user exercises conducted using machine and human translation samples and the assessment of task performance. By analyzing translation errors, user judgments about errors that interfere with task performance, and user task performance results, we isolate source language patterns which produce output problems. These patterns can then be captured in a single diagnostic test set, to be easily applied to any new Japanese-English system to predict the utility of its output.
Multilingual thesauri play a key role in multilingual text retrieval. At present, only a small number of on-line thesauri contain translations of terms in languages other than English. This is the case of the Unified Medical Language System (UMLS) Metathesaurus that includes the same term in different languages (e.g., English and Spanish). However, only a subset of terms in English have a corresponding translation in Spanish. In this work, I present an approach and some experimental results for reusing translated terms to expand the Metathesaurus. The approach includes two main tasks: finding patterns and formulating rules to automate the translation of English terms into Spanish terms. The approach is based on pattern matching, morphological rules, and word order inversion.
The Internet is rapidly changing the face of business and dramatically transforming people’s working and private lives. These developments present both a challenge and an opportunity to many technologies, one of the most important being Machine Translation. The Internet will soon be the most important medium for offering and finding information, and one of the principle means of communication for both companies and private users. There are many players on the Internet scene, each with different needs. Some players require help in presenting their information to an international audience, others require help in finding the information they seek and, because the Internet is increasingly multilingual, help in understanding that which they find. This paper attempts to identify the players and their needs, and outlines the products and services with which Machine Translation can help them to fully participate in the Internet revolution.
In this paper, we present the method to automatically revise morphological analysis errors caused by unregistered person names. In order to detect and revise their errors, we propose the Person Name Construction Model for kanji characters composing Japanese names. Our method has the advantage of not using context information, like a suffix, to recognize person names, thus making our method a useful one. Through the experiment, we show that our proposed model is effective.
This paper argues that, contrary to received wisdom in the MT research community, a transfer system such as LMT is well suited to deal with most of the problems that MT faces. It may in fact be superior to other approaches in that it can handle target surface-structure constraints, variation of syntactic patterns, discourse-structure constraints, and stylistic preference. The paper describes the linguistic issues involved in LMT’s English⇒German transformational component, its interaction with the lexical transfer component, and types of transformations. It identifies context-dependent and context-independent transformations and among the context-dependent ones, it differentiates between those that are triggered by instructions in the lexicon, by semantic category, by syntactic context, and by setting of stylistic preference. The paper concludes with some examples of divergence between English and German and shows how LMT handles them.
Statistical models have recently been applied to machine translation with interesting results. Algorithms for processing these models have not received wide circulation, however. By contrast, general finite-state transduction algorithms have been applied in a variety of tasks. This paper gives a finite-state reconstruction of statistical translation and demonstrates the use of standard tools to compute statistically likely translations. Ours is the first translation algorithm for “fertility/permutation” statistical models to be described in replicable detail.
This paper describes experiments for testing the power of large-scale resources for lexical selection in machine translation (MT) and cross-language information retrieval (CLIR). We adopt the view that verbs with similar argument structure share certain meaning components, but that those meaning components are more relevant to argument realization than to idiosyncratic verb meaning. We verify this by demonstrating that verbs with similar argument structure as encoded in Lexical Conceptual Structure (LCS) are rarely synonymous in WordNet. We then use the results of this work to guide our implementation of an algorithm for cross-language selection of lexical items, exploiting the strengths of each resource: LCS for semantic structure and WordNet for semantic content. We use the Parka Knowledge-Based System to encode LCS representations and WordNet synonym sets and we implement our lexical-selection algorithm as Parka-based queries into a knowledge base containing both information types.
Translation systems tend to have more trouble with long sentences than with short ones for a variety of reasons. When the source and target languages differ rather markedly, as do Japanese and English, this problem is reflected in lower quality output. To improve readability, we experimented with automatically splitting long sentences into shorter ones. This paper outlines the problem, describes the sentence splitting procedure and rules, and provides an evaluation of the results.
This paper proposes a design of verb entries in Interlingua to facilitate the machine translation (MT) of two languages with transitivity divergence as derived from their shared and individual linguistic characteristics. It suggests that the transitivity difference is best treated with verb entries containing information of the causal relation of the expressed events. It also demonstrates how the proposed design of verb entries gives a principled treatment of aspect divergence in semantically corresponding verbs of a source language (SL) and a target language (TL). Although the current paper focuses on English and Japanese, the proposed treatment should be applicable to the MT of similarly divergent languages, since the proposed lexicon in language-independent Interlingua contains information on causal relations of events as necessary to bridge the transitivity difference.
Cross-language retrieval systems use queries in one natural language to guide retrieval of documents that might be written in another. Acquisition and representation of translation knowledge plays a central role in this process. This paper explores the utility of two sources of translation knowledge for cross-language retrieval. We have implemented six query translation techniques that use bilingual term lists and one based on direct use of the translation output from an existing machine translation system; these are compared with a document translation technique that uses output from the same machine translation system. Average precision measures on a TREC collection suggest that arbitrarily selecting a single dictionary translation is typically no less effective than using every translation in the dictionary, that query translation using a machine translation system can achieve somewhat better effectiveness than simpler techniques, and that document translation may result in further improvements in retrieval effectiveness under some conditions.
Given the high labor costs of developing new lexical resources for Machine Translation (MT) and language processing systems, it is desirable to make the most of those resources already in existence. This paper describes the work being carried out on two MT projects that share a common goal: the creation, maintenance and reuse of lexical information. This goal calls into play a range of tasks from dictionary mining of machine-readable dictionaries (MRDs) to the definition of a repository capable of housing this diverse lexical information. This paper outlines the two efforts, focusing on the problems encountered and the intermediate results achieved. While the ultimate goal of the automated processing of on-line resources into multi-purpose lexical repositories is far from being achieved, our experience has shown that there are significant applications that can make use of the partially processed information produced en route. We will describe our experience with two projects, with a focus on one which utilized multiple lexical resources to provide the basis for two natural language processing (NLP) tools: a segmenter and a glosser for Thai. Finally, we make recommendations for future resource development, with a view toward mitigating the difficulties of merging information from diverse sources.