Building a Chinese-English mapping between verb concepts for multilingual applications
Bonnie J. Dorr
This paper addresses the problem of building conceptual resources for multilingual applications. We describe new techniques for large-scale construction of a Chinese-English lexicon for verbs, using thematic-role information to create links between Chinese and English conceptual information. We then present an approach to compensating for gaps in the existing resources. The resulting lexicon is used for multilingual applications such as machine translation and cross-language information retrieval.
Applying machine translation to two-stage cross-language information retrieval
Cross-language information retrieval (CLIR), where queries and documents are in different languages, needs a translation of queries and/or documents, so as to standardize both of them into a common representation. For this purpose, the use of machine translation is an effective approach. However, computational cost is prohibitive in translating large-scale document collections. To resolve this problem, we propose a two-stage CLIR method. First, we translate a given query into the document language, and retrieve a limited number of foreign documents. Second, we machine translate only those documents into the user language, and re-rank them based on the translation result. We also show the effectiveness of our method by way of experiments using Japanese queries and English technical documents.
Mixed-initiative translation of Web pages
A mixed-initiative system is one which allows more interactivity between the system and user, as the system is reasoning. We present some observations on the task of translating Web pages for users and suggest that a more interactive approach to this problem may be desirable. The aim is to interact with the user who is requesting the translation and the challenge is to determine the circumstances under which the user should be able to take the initiative to direct the processing or the system should be able to take the initiative to solicit further input from the user. In fact, we envision a need to support interactive translation of Web pages as the World Wide Web becomes more accessible to people with varying needs and abilities throughout the world.
A self-learning method of parallel texts alignment
This paper describes a language independent method for alignment of parallel texts that re-uses acquired knowledge. The system extracts word translation equivalents and re-uses them as correspondence points in order to enhance the alignment of parallel texts. Points that may cause misalignment are filtered using confidence bands of linear regression analysis instead of heuristics, which are not theoretically reliable. Homographs bootstrap the alignment process so as to build the primary word translation lexicon. At each step, the previously acquired lexicon is re-used so as to repeatedly make finer-grained alignments and produce more reliable translation lexicons.
Handling structural divergences and recovering dropped arguments in a Korean/English machine translation system
This paper describes an approach for handling structural divergences and recovering dropped arguments in an implemented Korean to English machine translation system. The approach relies on canonical predicate-argument structures (or dependency structures), which provide a suitable pivot representation for the handling of structural divergences and the recovery of dropped arguments. It can also be converted to and from the interface representations of many off-the-shelf parsers and generators.
A machine translation system from English to American Sign Language
Research in computational linguistics, computer graphics and autonomous agents has led to the development of increasingly sophisticated communicative agents over the past few years, bringing new perspective to machine translation research. The engineering of language- based smooth, expressive, natural-looking human gestures can give us useful insights into the design principles that have evolved in natural communication between people. In this paper we prototype a machine translation system from English to American Sign Language (ASL), taking into account not only linguistic but also visual and spatial information associated with ASL signs.
Oxygen: a language independent linearization engine
This paper describes a language independent linearization engine, oxyGen. This system compiles target language grammars into programs that take feature graphs as inputs and generate word lattices that can be passed along to the statistical extraction module of the generation system Nitrogen. The grammars are written using a flexible and powerful language, oxyL, that has the power of a programming language but focuses on natural language realization. This engine has been used successfully in creating an English linearization program that is currently employed as part of a Chinese-English machine translation system.
Information structure transfer: bridging the information gap in structurally different languages
This paper presents the implementation part of my doctoral research at the University of Cambridge. It provides a description of the Information Structure Transfer (IST), a machine translation prototype designed within the framework of the Spoken Language Translator (SLT by SRI, Cambridge/Palo Alto) and based on the Core Language Engine (). The IST includes two discourse-processing modules: the pre-transfer Information Structure Activator (ISA) and the post-transfer Information Structure Generator (ISG). The IST prototype calculates and processes vital features of information structure explored in context of structural differences between positional and nonpositional languages. It offers algorithmic solutions and an implementation framework for local discourse processing in machine translation. Under scrutiny is a web of interrelated factors such as pronominalization, anaphora resolution, zero anaphors, definiteness and constituent order.
The effect of source analysis on translation confidence
Michael C. McCord
Translations produced by an MT system can automatically be assigned a number that reflects the MT system’s confidence in their quality. We describe the design of such a confidence index, with focus on the contribution of source analysis, which plays a crucial role in many MT systems, including ours. Various problematic areas of source analysis are identified, and their impact on the overall confidence index is given. We will describe two methods of training the confidence index, one by hand-tuning of the heuristics, the other by linear regression analysis.
Contemplating automatic MT evaluation
John S. White
Researchers, developers, translators and information consumers all share the problem that there is no accepted standard for machine translation. The problem is much further confounded by the fact that MT evaluations properly done require a considerable commitment of time and resources, an anachronism in this day of cross-lingual information processing when new MT systems may developed in weeks instead of years. This paper surveys the needs addressed by several of the classic “types” of MT, and speculates on ways that each of these types might be automated to create relevant, near-instantaneous evaluation of approaches and systems.
How are you doing? A look at MT evaluation
Machine Translation evaluation has been more magic and opinion than science. The history of MT evaluation is long and checkered - the search for objective, measurable, resource-reduced methods of evaluation continues. A recent trend towards task-based evaluation inspires the question - can we use methods of evaluation of language competence in language learners and apply them reasonably to MT evaluation? This paper is the first in a series of steps to look at this question. In this paper, we will present the theoretical framework for our ideas, the notions we ultimately aim towards and some very preliminary results of a small experiment along these lines.
Recycling annotated parallel corpora for bilingual document composition
Parallel corpora enriched with descriptive annotations facilitate multilingual authoring development. Departing from an annotated bitext we show how SGML markup can be recycled to produce complementary language resources. On the one hand, several translation memory databases together with glossaries of proper nouns have been produced. On the other, DTDs for source and target documents have been derived and put into correspondence. This paper discusses how these resources have been automatically generated and applied to an interactive bilingual authoring system. This tool is capable of handling a substantial proportion of text both in the composition and translation of structured documents.
Combining invertible example-based machine translation with translation memory technology
This paper presents an approach to extract invertible trans- lation examples from pre-aligned reference translations. The set of in- vertible translation examples is used in the Example-Based Machine Translation (EBMT) system EDGAR for translation. Invertible bilin- gual grammars eliminate translation ambiguities such that each source language parse tree maps into only one target language string. The trans- lation results of EDGAR are compared and combined with those of a translation memory (TM). It is shown that i) best translation results are achieved for the EBMT system when using a bilingual lexicon to sup- port the alignment process ii) TMs and EBMT-systems can be linked in a dynamical sequential manner and iii) the combined translation of TMs and EBMT is in any case better than each of the single system.
What’s been forgotten in translation memory
Although undeniably useful for the translation of certain types of repetitive document, current translation memory technology is limited by the rudimentary techniques employed for approximate matching. Such systems, moreover, incorporate no real notion of a document, since the databases that underlie them are essentially composed of isolated sentence strings. As a result, current TM products can only exploit a small portion of the knowledge residing in translators’ past production. This paper examines some of the changes that will have to be implemented if the technology is to be made more widely applicable.
Understanding politics by studying weather: a cognitive approach to representation of Polish verbs of motion, appearance, and existence
The paper deals with the question whether representations of verb semantics formulated on the basis of a lexically and syntactically restricted domain (weather forecasts) can apply to other, less restricted textual domains. An analysis of a group of Polish polysemous verbs of motion, existence and appearance inspired by cognitive semantics, especially the metaphor theory, is presented, and the usefulness of the conceptual representations of the Polish motion/appearance/existence verbs for automatic translation of texts belonging to less restricted domains is evaluated and discussed.
Small but efficient: the misconception of high-frequency words in Scandinavian translation
Machine translation has proved itself to be easier between languages that are closely related, such as German and English, while far apart languages, such as Chinese and English, encounter much more problems. The present study focuses upon Swedish and Norwegian; two languages so closely related that they would be referred to as dialects if it were not for the fact that they had a Royal house and an army connected to each of them. Despite their similarity though, some differences make the translation phase much less straight-forward than what could be expected. Taking the outset in sentence aligned parallel texts, this study aims at highlighting some of the differences, and to formalise the results. In order to do so, the texts have been aligned on smaller units, by a simple cognate alignment method. Not at all surprising, the longer words were easier to align, while shorter and often high-frequent words became a problem. Also when trying to align to a specific word sense in a dictionary, content words rendered better results. Therefore, we abandoned the use of single-word units, and searched for multi-word units whenever possible. This study reinforces the view that Machine Translation should rest upon methods based on multiword unit searches.
Challenges in adapting an interlingua for bidirectional English-Italian translation
We describe our experience in adapting an existing high- quality, interlingual, unidirectional machine translation system to a new domain and bidirectional translation for a new language pair (English and Italian). We focus on the interlingua design changes which were necessary to achieve high quality output in view of the language mismatches between English and Italian. The representation we propose contains features that are interpreted differently, depending on the translation direction. This decision simplified the process of creating the interlingua for individual sentences, and allows the system to defer mapping of language-specific features (such as tense and aspect), which are realized when the target syntactic feature structure is created. We also describe a set of problems we encountered in translating modal verbs, and discuss the representation of modality in our interlingua.
Text meaning representation as a basis for representation of text interpretation
In this paper we propose a representation for what we have called an interpretation of a text. We base this representation on TMR (Text Meaning Representation), an interlingual representation developed for Machine Translation purposes. A TMR consists of a complex feature-value structure, with the feature names and filler values drawn from an ontology, in this case, ONTOS, developed concurrently with TMR. We suggest on the basis of previous work, that a representation of an interpretation of a text must build on a TMR structure for the text in several ways: (1) by the inclusion of additional required features and feature values (which may themselves be complex feature structures); (2) by pragmatically filling in empty slots in the TMR structure itself; and (3) by supporting the connections between feature values by including, as part of the TMR itself, the chains of inferencing that link various parts of the structure.