Conference of the Association for Machine Translation in the Americas (2000)
- Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions 6 papers
- Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: Technical Papers 18 papers
- Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: System Descriptions 7 papers
- Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: User Studies 3 papers
- Proceedings of the Workshop on Machine translation in practice: from old guard to new guard 7 papers
Handling structural divergences and recovering dropped arguments in a Korean/English machine translation system
Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: Technical Papers
This paper addresses the problem of building conceptual resources for multilingual applications. We describe new techniques for large-scale construction of a Chinese-English lexicon for verbs, using thematic-role information to create links between Chinese and English conceptual information. We then present an approach to compensating for gaps in the existing resources. The resulting lexicon is used for multilingual applications such as machine translation and cross-language information retrieval.
Cross-language information retrieval (CLIR), where queries and documents are in different languages, needs a translation of queries and/or documents, so as to standardize both of them into a common representation. For this purpose, the use of machine translation is an effective approach. However, computational cost is prohibitive in translating large-scale document collections. To resolve this problem, we propose a two-stage CLIR method. First, we translate a given query into the document language, and retrieve a limited number of foreign documents. Second, we machine translate only those documents into the user language, and re-rank them based on the translation result. We also show the effectiveness of our method by way of experiments using Japanese queries and English technical documents.
A mixed-initiative system is one which allows more interactivity between the system and user, as the system is reasoning. We present some observations on the task of translating Web pages for users and suggest that a more interactive approach to this problem may be desirable. The aim is to interact with the user who is requesting the translation and the challenge is to determine the circumstances under which the user should be able to take the initiative to direct the processing or the system should be able to take the initiative to solicit further input from the user. In fact, we envision a need to support interactive translation of Web pages as the World Wide Web becomes more accessible to people with varying needs and abilities throughout the world.
This paper describes a language independent method for alignment of parallel texts that re-uses acquired knowledge. The system extracts word translation equivalents and re-uses them as correspondence points in order to enhance the alignment of parallel texts. Points that may cause misalignment are filtered using confidence bands of linear regression analysis instead of heuristics, which are not theoretically reliable. Homographs bootstrap the alignment process so as to build the primary word translation lexicon. At each step, the previously acquired lexicon is re-used so as to repeatedly make finer-grained alignments and produce more reliable translation lexicons.
Handling structural divergences and recovering dropped arguments in a Korean/English machine translation system
Chung-hye Han | Benoit Lavoie | Martha Palmer | Owen Rambow | Richard Kittredge | Tanya Korelsky | Nari Kim | Myunghee Kim
This paper describes an approach for handling structural divergences and recovering dropped arguments in an implemented Korean to English machine translation system. The approach relies on canonical predicate-argument structures (or dependency structures), which provide a suitable pivot representation for the handling of structural divergences and the recovery of dropped arguments. It can also be converted to and from the interface representations of many off-the-shelf parsers and generators.
Research in computational linguistics, computer graphics and autonomous agents has led to the development of increasingly sophisticated communicative agents over the past few years, bringing new perspective to machine translation research. The engineering of language- based smooth, expressive, natural-looking human gestures can give us useful insights into the design principles that have evolved in natural communication between people. In this paper we prototype a machine translation system from English to American Sign Language (ASL), taking into account not only linguistic but also visual and spatial information associated with ASL signs.
This paper describes a language independent linearization engine, oxyGen. This system compiles target language grammars into programs that take feature graphs as inputs and generate word lattices that can be passed along to the statistical extraction module of the generation system Nitrogen. The grammars are written using a flexible and powerful language, oxyL, that has the power of a programming language but focuses on natural language realization. This engine has been used successfully in creating an English linearization program that is currently employed as part of a Chinese-English machine translation system.
This paper presents the implementation part of my doctoral research at the University of Cambridge. It provides a description of the Information Structure Transfer (IST), a machine translation prototype designed within the framework of the Spoken Language Translator (SLT by SRI, Cambridge/Palo Alto) and based on the Core Language Engine (). The IST includes two discourse-processing modules: the pre-transfer Information Structure Activator (ISA) and the post-transfer Information Structure Generator (ISG). The IST prototype calculates and processes vital features of information structure explored in context of structural differences between positional and nonpositional languages. It offers algorithmic solutions and an implementation framework for local discourse processing in machine translation. Under scrutiny is a web of interrelated factors such as pronominalization, anaphora resolution, zero anaphors, definiteness and constituent order.
Translations produced by an MT system can automatically be assigned a number that reflects the MT system’s confidence in their quality. We describe the design of such a confidence index, with focus on the contribution of source analysis, which plays a crucial role in many MT systems, including ours. Various problematic areas of source analysis are identified, and their impact on the overall confidence index is given. We will describe two methods of training the confidence index, one by hand-tuning of the heuristics, the other by linear regression analysis.
Researchers, developers, translators and information consumers all share the problem that there is no accepted standard for machine translation. The problem is much further confounded by the fact that MT evaluations properly done require a considerable commitment of time and resources, an anachronism in this day of cross-lingual information processing when new MT systems may developed in weeks instead of years. This paper surveys the needs addressed by several of the classic “types” of MT, and speculates on ways that each of these types might be automated to create relevant, near-instantaneous evaluation of approaches and systems.
Machine Translation evaluation has been more magic and opinion than science. The history of MT evaluation is long and checkered - the search for objective, measurable, resource-reduced methods of evaluation continues. A recent trend towards task-based evaluation inspires the question - can we use methods of evaluation of language competence in language learners and apply them reasonably to MT evaluation? This paper is the first in a series of steps to look at this question. In this paper, we will present the theoretical framework for our ideas, the notions we ultimately aim towards and some very preliminary results of a small experiment along these lines.
Parallel corpora enriched with descriptive annotations facilitate multilingual authoring development. Departing from an annotated bitext we show how SGML markup can be recycled to produce complementary language resources. On the one hand, several translation memory databases together with glossaries of proper nouns have been produced. On the other, DTDs for source and target documents have been derived and put into correspondence. This paper discusses how these resources have been automatically generated and applied to an interactive bilingual authoring system. This tool is capable of handling a substantial proportion of text both in the composition and translation of structured documents.
This paper presents an approach to extract invertible trans- lation examples from pre-aligned reference translations. The set of in- vertible translation examples is used in the Example-Based Machine Translation (EBMT) system EDGAR for translation. Invertible bilin- gual grammars eliminate translation ambiguities such that each source language parse tree maps into only one target language string. The trans- lation results of EDGAR are compared and combined with those of a translation memory (TM). It is shown that i) best translation results are achieved for the EBMT system when using a bilingual lexicon to sup- port the alignment process ii) TMs and EBMT-systems can be linked in a dynamical sequential manner and iii) the combined translation of TMs and EBMT is in any case better than each of the single system.
Although undeniably useful for the translation of certain types of repetitive document, current translation memory technology is limited by the rudimentary techniques employed for approximate matching. Such systems, moreover, incorporate no real notion of a document, since the databases that underlie them are essentially composed of isolated sentence strings. As a result, current TM products can only exploit a small portion of the knowledge residing in translators’ past production. This paper examines some of the changes that will have to be implemented if the technology is to be made more widely applicable.
The paper deals with the question whether representations of verb semantics formulated on the basis of a lexically and syntactically restricted domain (weather forecasts) can apply to other, less restricted textual domains. An analysis of a group of Polish polysemous verbs of motion, existence and appearance inspired by cognitive semantics, especially the metaphor theory, is presented, and the usefulness of the conceptual representations of the Polish motion/appearance/existence verbs for automatic translation of texts belonging to less restricted domains is evaluated and discussed.
Machine translation has proved itself to be easier between languages that are closely related, such as German and English, while far apart languages, such as Chinese and English, encounter much more problems. The present study focuses upon Swedish and Norwegian; two languages so closely related that they would be referred to as dialects if it were not for the fact that they had a Royal house and an army connected to each of them. Despite their similarity though, some differences make the translation phase much less straight-forward than what could be expected. Taking the outset in sentence aligned parallel texts, this study aims at highlighting some of the differences, and to formalise the results. In order to do so, the texts have been aligned on smaller units, by a simple cognate alignment method. Not at all surprising, the longer words were easier to align, while shorter and often high-frequent words became a problem. Also when trying to align to a specific word sense in a dictionary, content words rendered better results. Therefore, we abandoned the use of single-word units, and searched for multi-word units whenever possible. This study reinforces the view that Machine Translation should rest upon methods based on multiword unit searches.
We describe our experience in adapting an existing high- quality, interlingual, unidirectional machine translation system to a new domain and bidirectional translation for a new language pair (English and Italian). We focus on the interlingua design changes which were necessary to achieve high quality output in view of the language mismatches between English and Italian. The representation we propose contains features that are interpreted differently, depending on the translation direction. This decision simplified the process of creating the interlingua for individual sentences, and allows the system to defer mapping of language-specific features (such as tense and aspect), which are realized when the target syntactic feature structure is created. We also describe a set of problems we encountered in translating modal verbs, and discuss the representation of modality in our interlingua.
In this paper we propose a representation for what we have called an interpretation of a text. We base this representation on TMR (Text Meaning Representation), an interlingual representation developed for Machine Translation purposes. A TMR consists of a complex feature-value structure, with the feature names and filler values drawn from an ontology, in this case, ONTOS, developed concurrently with TMR. We suggest on the basis of previous work, that a representation of an interpretation of a text must build on a TMR structure for the text in several ways: (1) by the inclusion of additional required features and feature values (which may themselves be complex feature structures); (2) by pragmatically filling in empty slots in the TMR structure itself; and (3) by supporting the connections between feature values by including, as part of the TMR itself, the chains of inferencing that link various parts of the structure.
Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: System Descriptions
As is known, the majority of the actual textual content on the Internet is in English language. This represents an obstacle to those non-English speaking users willing to access the Internet. The idea behind this MT-based application is to allow any Arabic user to search and navigate through the Internet using Arabic language without the need to have prior knowledge of English language. The infrastructure of TARJIM.COM relies on 3 basic core components : 1- The Bi-directional English-Arabic Machine translation Engine, 2- The intelligent Web page layout preserving component and 3-The Search Engine query interceptor.
In this paper we describe the KANTOO machine translation environment, a set of software services and tools for multilingual document production. KANTOO includes modules for source language analysis, target language generation, source terminology management, target terminology management, and knowledge source development. The KANTOOsystem represents a complete re-design and re-implementation of the KANT machine translation system.
ARL’s FALCon system has proven its integrated OCR and MT technology to be a valuable asset to soldiers in the field in both Bosnia and Haiti. Now it is being extended to include six more SYSTRAN language pairs in response to the military’s need for automatic translation capabilities as they pursue US national objectives in East Asia. The Pacific Rim Portable Translator will provide robust automatic translation bidirectionally for English, Chinese, Japanese, and Korean, which will allow not only rapid assimilation of foreign information, but two-way communication as well for both the public and private sectors.
The LabelTool/TrTool system is designed to administer text strings that are shown in devices with a very limited display area and translated into a very large number of foreign languages. Automation of character set handling and file naming and storage together with real–time simulation of text string input are the main features of this application.
The LogoVista ES translation system translates English text to Spanish. It is a member of LEC’s family of translation tools and uses the same engine as LogoVista EJ. This engine, which has been under development for ten years, is heavily linguistic and rule-based. It includes a very large, highly annotated English dictionary that contains detailed syntactic, semantic and domain information; a binary parser that produces multiple parses for each sentence; a 12,000+-rule, context-free English grammar; and a synthesis file of rules that convert each parsed English structure into a Spanish structure. The main tasks involved in developing a new language pair include the addition of target-language translations to the dictionary and the addition of rules to the synthesis file. The system’s modular design allows the work to be carried out by linguists, independent of engineers.
One of the most important components of any machine translation system is the translation lexicon. The size and quality of the lexicon, as well as the coverage of the lexicon for a particular use, greatly influence the applicability of machine translation for a user. The high cost of lexicon development limits the extent to which even mature machine translation vendors can expand and specialize their lexicons, and frequently prevents users from building extensive lexicons at all. To address the high cost of lexicography for machine translation, L&H is building a Lexicography Toolkit that includes tools that can significantly improve the process of creating custom lexicons. The toolkit is based on the concept of using automatic methods of data acquisition, using text corpora, to generate lexicon entries. Of course, lexicon entries must be accurate, so the work of the toolkit must be checked by human experts at several stages. However, this checking mostly consists of removing erroneous results, rather than adding data and entire entries. This article will explore how the Lexicography Toolkit would be used to create a lexicon that is specific to the user’s domain.
This paper describes some of the features of the new 32-bit Windows version of PAHO’s English-Spanish (ENGSPAN®) and Spanish-English (SPANAM®) machine translation software. The new dictionary update interface is designed to help users add their own terminology to the lexicon and encourage them to write context-sensitive rules to improve the quality of the output. Expanded search capabilities provide instant access to related source and target entries, expressions, and rules. A live system demonstration will accompany this presentation.
Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: User Studies
This paper discusses an informal methodology for evaluating Machine Translation software documentation with reference to a case study, in which a number of currently available MT packages are evaluated. Different types of documentation style are discussed, as well as different user profiles. It is found that documentation is often inadequate in identifying the level of linguistic background and knowledge necessary to use translation software, and in explaining technical (linguistic) terms needed to use the software effectively. In particular, the level of knowledge and training needed to use the software is often incompatible with the user profile implied by the documentation. Also, guidance on how to perform more complex tasks, which may be especially idiosyncratic, is often inadequate or missing altogether.
“Embedded” machine translation (MT) refers to an end-to-end computational process of which MT is one of the components. Integrating these components and evaluating the whole has proved to be problematic. As an example of embedded MT, we describe a prototype system called Falcon, which permits paper documents to be scanned and translated into English. MT is thus embedded in the preprocessing of hardcopy pages and subject to its noise. Because Falcon is intended for use by people in the military who are trying to screen foreign documents, and not to understand them in detail, its application makes low demands on translation quality. We report on a series of user trials that speak to the utility of embedded MT in army tasks.
We present four kinds of machine translation system in this description: E-K (English to Korean), K-E (Korean to English), J-K (Japanese to Korean), K-J (Korean to Japanese). Among these, E-K and K-J translation systems are published commercially, and the other systems have finished their development. This paper describes the structure and function of each system with figures and translation results.
The internet is no longer English only. The data is voluminous and the number of proficient linguists cannot match the day to day needs of several government agencies. Handling foreign languages is not limited to translating documents but goes beyond the journalistic written formats. Military, diplomatic and official interactions in the US and abroad require more than one or two foreign language skills. The CHALLENGE is both managing the user’s expectations and stimulating new areas for MT research and development.
The application of MT on the Internet has certainly attracted much attention in recent years, and many observers see its future mostly in this arena of real-time raw translation. However, the need for high-volume, fast turn-around translation of publication quality has not abated. This paper will take stock of that particular use of MT and venture predictions as to its future.
his paper is concerned with the technology of using the PARS English-Russian bi- directional machine translation systems in teaching English as a foreign language. This technology has no connection with the old form of computer-assisted language learning which uses «drill-and-practice» computer exercises and provides a sort of surrogate «electronic teacher». The main objective of the educational implication of PARS is to help the learner become familiar with the words in their normal contexts. The introduction of a machine translation system into teaching foreign languages is intended to get the most fruitful pedagogical results from the use of personal computers and expose the learners to the up-to-date information technologies.
ENGSPAN, a machine translation program (English-Spanish), has been used by the Translation Services unit of the Pan American Health Organization since 1985. In 1999, a total of 2,106,178 words were translated in that language combination, 86% of which were done with the help of ENGSPAN; the cost per word was 8.75 cents, that is, 31% below the normal rate. These positive results are explained by a combination of factors: the use of an MT program especially designed to meet the needs of the institution; the close collaboration of translators and computational linguists in the improvement of the program; the application of a pragmatic, flexible, and selective approach with regard to the quality of the end product; and in particular the support of competent translators who do the postediting work.
Our project Wired for Peace: Virtual Diplomacy in Northeast Asia (Http://www- neacd.ucsd.edu/) has as its main aim to provide policymakers and researchers of the U.S., China, Russia, Japan, and Korea with Internet based tools to allow for continuous communication on issues of the regional security and cooperation. Since the very beginning of the project, we have understood that Web-based translation between English and Asian languages would be one of the most necessary tools for successful development of the project. With this understanding, we have partnered with Systran (www.systransoft.com), one of the leaders in MT field, in order to develop Internet-based tools for both synchronous and asynchronous translation of texts and discussions. This submission is a report on a work in progress.
The Internet is a wonderful medium that frees its users from the confines of geographic boundaries. While the acceptance of the Internet is pervasive, the language barrier is somewhat tougher to overcome. Several options exists on the market to deliver multilingual content, few solutions can stand up to the dynamic demand of a modern website. Language context, translation turnaround times, and various business models are all barriers to creating a total solution for globalization and localization of websites. We will examine the difficulties in localizing a dynamic website and discuss the challenges we have overcome to create a dynamic translation platform.