pdf
bib
Proceedings of Machine Translation Summit VIII
Bente Maegaard
pdf
bib
Towards a new vision for MT
John Hutchins
pdf
bib
abs
Generation for multilingual MT
Takako Aikawa
|
Maite Melero
|
Lee Schwartz
|
Andi Wu
This paper presents an overview of the broad-coverage, application-independent natural language generation component of the NLP system being developed at Microsoft Research. It demonstrates how this component functions within a multilingual Machine Translation system (MSR-MT), using the languages that we are currently working on (English, Spanish, Japanese, and Chinese). Section 1 provides a system description of MSR-MT. Section 2 focuses on the generation component and its set of core rules. Section 3 describes an additional layer of generation rules with examples that address issues specific to MT. Section 4 presents evaluation results in the context of MSR-MT. Section 5 addresses generation issues outside of MT.
pdf
abs
Using multiple edit distances to automatically rank machine translation output
Yasuhiro Akiba
|
Kenji Imamura
|
Eiichiro Sumita
This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems in order to support the developers of these systems. Conventional approaches to the problem include methods that automatically assign a rank such as A, B, C, or D to MT output according to a single edit distance between this output and a correct translation example. The single edit distance can be differently designed, but changing its design makes assigning a certain rank more accurate, but another rank less accurate. This inhibits improving accuracy of rank assignment. To overcome this obstacle, this paper proposes an automatic ranking method that, by using multiple edit distances, encodes machine-translated sentences with a rank assigned by humans into multi-dimensional vectors from which a classifier of ranks is learned in the form of a decision tree (DT). The proposed method assigns a rank to MT output through the learned DT. The proposed method is evaluated using transcribed texts of real conversations in the travel arrangement domain. Experimental results show that the proposed method is more accurate than the single-edit-distance-based ranking methods, in both closed and open tests. Moreover, the proposed method could estimate MT quality within 3% error in some cases.
pdf
abs
Collapsing morphological information in lexical databases for NLP applications
Juan A. Alonso
|
Ramón Fanlo
|
Albert Llorens
The morphology of inflectional languages poses specific problems in the processing of morphological alternations. Regular alternations at morpheme boundaries can be elegantly captured by the use of rule formalisms based on the two-level morphology model. Stem alternations and completely irregular alternations at morpheme boundaries, however, need to be captured in some way in the lexicon. This paper presents four possible solutions to the problem and makes a claim in favor of one of them. The proposed approach makes use of feature bundles that contain the necessary linguistic information to uniquely identify allomorphic variations of stems in the lexicon. The proposal is an improvement in that it simplifies the representation of allomorphic variations in the lexicon by avoiding duplication of stem allomorphs to capture cross-combination of several morphosyntactic features in stem+flex sequences.
pdf
abs
Finding translation correspondences from parallel parsed corpus for example-based translation
Eiji Aramaki
|
Sadao Kurohashi
|
Satoshi Sato
|
Hideo Watanabe
This paper describes a system for finding phrasal translation correspondences from parallel parsed corpus that are collections paired English and Japanese sentences. First, the system finds phrasal correspondences by Japanese-English translation dictionary consultation. Then, the system finds correspondences in remaining phrases by using sentences dependency structures and the balance of all correspondences. The method is based on an assumption that in parallel corpus most fragments in a source sentence have corresponding fragments in a target sentence.
pdf
abs
Generation of noun-noun compounds in the Spanish-English machine translation system SPANAM®
Julia Aymerich
The translation of Spanish Noun + preposition + Noun (NPN) constructions into English Noun-Noun (NN) compounds in many cases produces output with a higher level of fluency than if the NPN ordering is preserved. However, overgeneration of NN compounds can be dangerous because it may introduce ambiguity in the translation. This paper presents the strategy implemented in SPANAM to address this issue. The strategy involves dictionary coding of key words and expressions that allow or prohibit NN formation as well as an algorithm that generates NN compounds automatically when no dictionary coding is present. Certain conditions specified in the algorithm may also override the dictionary coding. The strategy makes use of syntactic and lexical information. No semantic coding is required. The last step in the strategy involves post-editing macros that allow the posteditor to quickly create or undo NN compounds if SPANAM did not generate the desired result.
pdf
abs
Gross-grained RST through XML metadata for multilingual document generation
Guillermo Barrutieta
|
Joseba Abaitua
|
Josuha Díaz
We present an RST-based discourse annotation proposal used in the construction of a trial multilingual XML-tagged corpus of teaching material in Basque, English and Spanish. The corpus feeds an experimental multilingual document generation system for the web. The main contributions of this paper are an implementation of RST through XML metadata and the adoption of gross-grained RST to avoid non-isomorphism in multilingual corpora.
pdf
abs
A taste of MALT
Ulrike Bernardi
|
Petra Gieselmann
|
Steve McLaughlin
Globalisation is bringing translation and multilingual information processing to areas where it was previously unknown or relatively unimportant. Today, translation is not only important for reaching global audiences, it is becoming an indispensable component inside other systems and workflows. MALT (Modular Architecture for Linguistic Tools) represents a fresh approach to a relatively new problem; how to provide translation capabilities plus any other vital linguistic tools and components inside a common framework, possibly together with other external applications. MALT’s modular structure and multi-tier architecture simplify integration into complex workflow scenarios, and the functional separation in the MALT interface permits new components to be added extremely quickly. The applications and components running under MALT can be accessed locally, in a network environment or as engines of a distributed client-server system such as DTS.
pdf
abs
An integrated solution: applying PROMT machine translation technology, terminology mining, and the TRADOS TWB translation memory to SAP content translation
Ulrich Boehme
|
Svetlana Svetova
This paper describes the experiences of SAP and PROMT specialists with applying the PROMT English-Russian machine translation system, the PROMT Terminology Manager Tool for automatic terminology extraction, and the TRADOS TWB translation memory system to the automated process of translation of SAP content from English into Russian.
pdf
abs
Design and construction of a machine-tractable Japanese-Malay dictionary
Francis Bond
|
Ruhaida Binti Sulong
|
Takefumi Yamazaki
|
Kentaro Ogura
We present a method for combining two bilingual dictionaries to make a third, using one language as a pivot. In this case we combine a Japanese-English dictionary with a Malay-English dictionary, to produce a Japanese-Malay dictionary suitable for use in a machine translation system. Our method differs from previous methods in its use of semantic classes to rank translation equivalents: word pairs with compatible semantic classes are preferred to those with dissimilar classes. We also experiment with the use of two pivot languages. We have made a prototype dictionary of over 75,000 pairs.
pdf
abs
Machine translation - evolution not revolution
Jennifer A Brundage
The continuous trend towards globalization means that even the most modern of industries must constantly re-evaluate its strategies and adapt to new technologies. This not only involves living up to the demands set by the product life cycles but also to find solutions satisfying additional internal needs. As a long-time supporter of MT and TM technology, SAP has shown that it can make productive use of competitive, commercial NLP products. As a first step, an integrated solution using TM together with MT was targeted. Having implemented different solutions for two types of documentation, the focus is now on not merely to integrate other technologies (e.g. terminology mining or controlled language) but to provide a uniform solution for processing any type of text. This involves not only supporting the needs of technical writers and translators but of all employees in their multilingual working environment.
pdf
abs
A program for automatically selecting the best output from multiple machine translation engines
Chris Callison-Burch
|
Raymond S. Flournoy
This paper describes a program that automatically selects the best translation from a set of translations produced by multiple commercial machine translation engines. The program is simplified by assuming that the most fluent item in the set is the best translation. Fluency is determined using a trigram language model. Results are provided illustrating how well the program performs for human ranked data as compared to each of its constituent engines.
pdf
abs
The ISLE in the ocean. Transatlantic standards for multilingual lexicons (with an eye to machine translation)
Nicoletta Calzolari
|
Alessandro Lenci
|
Antonio Zampolli
|
Nuria Bel
|
Marta Villegas
|
Gregor Thurmair
The ISLE project is a continuation of the long standing EAGLES initiative, carried out under the Human Language Technology (HLT) programme in collaboration between American and European groups in the framework of the EU-US International Research Co-operation, supported by NSF and EC. In this paper we concentrate on the current position of the ISLE Computational Lexicon Working Group (CLWG), whose activities aim at defining a general schema for a multilingual lexical entry (MILE), as the basis for a standard framework for multilingual computational lexicons. The needs and features of existing Machine Translation systems provide the main reference points for the process of consensual definition of the MILE. The overall structure of the MILE will be illustrated with particular attention to some of the issues raised for multilingual lexicons by the need of expressing complex transfer conditions among translation equivalents
pdf
abs
The Spanish<>Catalan machine translation system interNOSTRUM
R. Canals-Marote
|
A. Esteve-Guillén
|
A. Garrido-Alenda
|
M. I. Guardiola-Savall
|
A. Iturraspe-Bellver
|
S. Montserrat-Buendia
|
S. Ortiz-Rojas
|
H. Pastor-Pina
|
P. M. Pérez-Antón
|
M. L. Forcada
This paper describes interNOSTRUM, a Spanish3Catalan machine translation system currently under development that achieves great speed through the use of finite-state technologies (so that it may be integrated with internet browsing) and a reasonable accuracy using an advanced morphological transfer strategy (to produce fast translation drafts ready for light postedition).
pdf
abs
Trial and error: an evaluation project on Japanese <> English MT output quality
Maki Darwin
This paper describes a small-scale but organized attempt to evaluate output quality of several Japanese MT systems. The project also served as the first experiment of the implementation of the in-house MT evaluation guidelines created in 2000. Since time was limited and the budget was not infinite, it was launched with the following compact components: Five people; 300 source sentences per language pair; and 160 hours per evaluator. The quantitative results showed noteworthy phenomena. Although the test materials had been presented in a way that evaluators could not identify the performance of any particular system, the results were quite consistent. The scoring ratio that the two E-to-J evaluators employed was almost identical, while that of the J-to-E evaluators was similar. This indicates that high-quality output has universal appeal. Additionally, the evaluators noted that stronger systems, regardless of language pair, tended to be superior in source sentence analysis, target sentence arrangement, word choice, and lexicon entries whereas weaker systems tended to be inferior in these areas. As for language-pair comparison, the results indicate that English-to-Japanese systems may require more improvement than their counterparts, judging from the scores given and the number of unfound words recorded.
pdf
abs
Dictionary development workflow for MT: design and management
Mike Dillinger
An important part of the development of any machine translation system is the creation of lexical resources. We describe an analysis of the dictionary development workflow and supporting tools currently in use and under development at Logos. This workflow identifies the component processes of: setting goals, locating and acquiring lexical resources, transforming the resources to a common format, classifying and routing entries for special processing, importing entries, and verifying their adequacy in translation. Our approach has been to emphasize the tools necessary to support increased automation and use of resources available in electronic formats, in the context of a systematic workflow design.
pdf
abs
The importance of MT for the survival of minority languages: Spanish-Galician MT system
Inés Diz Gamallo
Our society is coming through a lot of changes that are connected, basically, with information. Maybe those languages that are present in this challenge will survive, and languages that will not meet those changes will dissapear. The Linguistics section of the Centro Ramón Piñeiro para a Investigación en Humanidades (CRP) is devoted to the development of basic language resources for Galician for trying to solve the gap existing in computational resources and to made it possible for Galician to be present in the new information society. The aim of this paper is to explain how we have developed a Spanish-Galician Machine Translation system, what tools we have made use of, which difficulties we have found in our task and what are the final results of the project.
pdf
FUDR-based MT, head switching and the lexicon
Kurt Eberle
pdf
abs
Multilingual authoring through an artificial language
Marcos Franco Sabarís
|
José Luis Rojas Alonso
|
C. Dafonte
|
B. Arcay
Nowadays, there is a growing need for dissemination of documents in several languages. Machine translation is usually regarded as a possible solution for this, but so far it cannot provide acceptable translations of unedited texts. Several methods which involve human participation in computerized processes of translation have been proposed, but none has given really satisfactory results (except in some restricted contexts). In the UTL (Universal Translation Language) project, which we present here, we propose a new approach to multilingualization, based on the usage of an artificial unambiguous human language in which the human translator writes the source text, and then gives it to the machine to translate into other languages. The nature of this constructed language, which is optimized for this role, ensures the high quality of the results rendered by the computer.
pdf
abs
Evaluation method for determining groups of users who find MT “useful”
M. Fuji
|
N. Hatanaka
|
E. Ito
|
S. Kamei
|
H. Kumai
|
T. Sukehiro
|
T. Yoshimi
|
H. Isahara
This paper describes an evaluation experiment designed to determine groups of subjects who prefer reading MT outputs to reading the original text. Our approach can be applied to any language pairs, but we will explain the methodology by taking English to Japanese translation as an example. In the case of E-J MT, it can be assumed that main users are Japanese and that most of them have some knowledge of English. It is often the case, in the case of E-J MT systems, that those people who are comfortable with reading English do not find E-J MT outputs useful, and in many cases, they would rather prefer reading the original English text. On the other hand, E- J MT outputs prove to be useful to those who find it hard to read the original English texts. We have used the reading comprehension part of the Test Of English for International Communication (TOEIC) to determine the threshold English ability level, dividing these two user groups.
pdf
abs
Using machine learning for system-internal evaluation of transferred linguistic representations
Michael Gamon
|
Hisami Suzuki
|
Simon Corston-Oliver
We present an automated, system-internal evaluation technique for linguistic representations in a large-scale, multilingual MT system. We use machine-learned classifiers to recognize the differences between linguistic representations generated from transfer in an MT context from representations that are produced by "native" analysis of the target language. In the MT scenario, convergence of the two is the desired result. Holding the feature set and the learning algorithm constant, the accuracy of the classifiers provides a measure of the overall difference between the two sets of linguistic representations: classifiers with higher accuracy correspond to more pronounced differences between representations. More importantly, the classifiers yield the basis for error-analysis by providing a ranking of the importance of linguistic features. The more salient a linguistic criterion is in discriminating transferred representations from "native" representations, the more work will be needed in order to get closer to the goal of producing native-like MT. We present results from using this approach on the Microsoft MT system and discuss its advantages and possible extensions.
pdf
abs
Search algorithms for statistical machine translation based on dynamic programming and pruning techniques
Ismael García-Varea
|
Francisco Casacuberta
The increasing interest in the statistical approach to Machine Translation is due to the development of effective algorithms for training the probabilistic models proposed so far. However, one of the open problems with statistical machine translation is the design of efficient algorithms for translating a given input string. For some interesting models, only (good) approximate solutions can be found. Recently, a dynamic programming-like algorithm for the IBM-Model 2 has been proposed which is based on an iterative process of refinement solutions. A new dynamic programming-like algorithm is proposed here to deal with more complex IBM models (models 3 to 5). The computational cost of the algorithm is reduced by using an alignment-based pruning technique. Experimental results with the so-called “Tourist Task” are also presented.
pdf
abs
PolVerbNet: an experimental database for Polish verbs
Barbara Gawronska
The semantics of verbs implies, as is known, a great number of difficulties, when it is to be represented in a computational lexicon. The Slavic languages are especially challenging in respect of this task because of the huge complexity of verbs, where the stems are combined with prefixes indicating aspect and Aktionsart. The current paper describes an approach to build PolVerbNet, a database for Polish verbs, considering the internal structure of the aspect-Aktionsart system. PolVerbNet is thus implemented in a larger English-Polish MT-system, which incorporates WordNet. We report our translation procedure and the system’s performance is evaluated and discussed.
pdf
abs
Derivational morphology to the rescue: how it can help resolve unfound words in MT
Claudia Gdaniec
|
Esmé Manandise
|
Michael C. McCord
Machine Translation (MT) systems that process unrestricted text should be able to deal with words that are not found in the MT lexicon. Without some kind of recognition, the parse may be incomplete, there is no transfer for the unfound word, and tests for transfers for surrounding words will often fail, resulting in poor translation. Interestingly, not much has been published on unfound- word guessing in the context of MT although such work has been going on for other applications. In our work on the IBM MT system, we implemented a far-reaching strategy for recognizing unfound words based on rules of word formation and for generating transfers. What distinguishes our approach from others is the use of semantic and syntactic features for both analysis and transfer, a scoring system to assign levels of confidence to possible word structures, and the creation of transfers in the transformation component. We also successfully applied rules of derivational morphological analysis to non-derived unfound words.
pdf
abs
Semi-automatic evaluation of the grammatical coverage of machine translation systems
A. Guessoum
|
R. Zantout
In this paper we present a methodology for automating the evaluation of the grammatical coverage of machine translation (MT) systems. The methodology is based on the importance of unfolded grammatical structures, which represent the most basic syntactic pattern for a sentence in a given language. A database of unfolded grammatical structures is built to evaluate the parser of any NLP or MT system. The evaluation results in an overall measure called the grammatical coverage. The results of implementing the above approach on three English-to-Arabic commercial MT systems are presented.
pdf
Large scale language independent generation using thematic hierarchies
Nizar Habash
|
Bonnie Dorr
pdf
abs
AGILE - a system for multilingual generation of technical instructions
Anthony Hartley
|
Donia Scott
|
John Bateman
|
Danail Dochev
This paper presents a multilingual Natural Language Generation system that produces technical instruction texts in Bulgarian, Czech and Russian. It generates several types of texts, common for software manuals, in two styles. We illustrate the system’s functionality with examples of its input and output behaviour. We discuss the criteria and procedures adopted for evaluating the system and summarise their results. The system embodies novel approaches to providing multilingual documentation, ranging from the re-use of a large-scale, broad coverage grammar of English in order to develop the lexico-grammatical resources necessary for the generation in the three target languages, through to the adoption of a ‘knowledge editing’ approach to specifying the desired content of the texts to be generated independently of the target languages in which those texts finally appear.
pdf
abs
Decision lists for determining adjective dependency in Japanese
Taiichi Hashimoto
|
Kosuke Nishidate
|
Kiyoaki Shirai
|
Takenobu Tokunaga
|
Hozumi Tanaka
In Japanese constructions of the form [N1 no Adj N2], the adjective Adj modifies either N1 or N2. Determing the semantic dependencies of adjective in such phrase is an important task for machine translation. This paper describes a method for determining the adjective dependency in such constructions using decision lists, and inducing decision lists from training contexts with correct semantic dependencies and without. Based on evaluation, our method is able to determine adjective dependency with an precision of about 94%. We further analyze rules in the induced decision lists and examine effective features to determine the semantic dependencies of adjectives.
pdf
abs
ALT-J/C: a prototype Japanese-to-Chinese automatic language translation system
Minoru Hayashi
|
Setsuo Yamada
|
Akira Kataoka
|
Akio Yokoo
This paper describes a prototype Japanese-to-Chinese automatic language translation system. ALT-J/C (Automatic Language Translator - Japanese-to-Chinese) is a semantic transfer based system, which is based on ALT-J/E (a Japanese-to-English system), but written to cope with Unicode. It is also designed to cope with constructions specific to Chinese. This system has the potential to become a framework for multilingual translation systems.
pdf
PRIME: a system for multi-lingual patent retrieval
Shigeto Higuchi
|
Masatoshi Fukui
|
Atsushi Fujii
|
Tetsuya Ishikawa
pdf
abs
Machine translation using bilingual term entries extracted from parallel texts
Tatsuya Izuha
Patent summaries are machine-translated using bilingual term entries extracted from parallel texts for evaluation. The result shows that bilingual term entries extracted from 2,000 pairs of parallel texts which share a specific domain with the input texts introduce more improvements than a technical term dictionary with 38,000 entries which covers a broader domain. The result also shows that only 10 pairs of parallel texts found by similar document retrieval have comparable effects to the technical term dictionary, suggesting that parallel texts to be used do not need to be classified into fields prior to term extraction.
pdf
abs
Generation of named entities
Marisa Jiménez
In this paper we present an overview of an approach developed at Microsoft Research to generate strings for named entities such as places and dates. This approach uses abstract representations as input. We first provide an overview of our system to identify named entities in text. Next we present our approach to generate these entities from abstract representations, known as “logical forms” in our system. We then focus on the generation of place names in Spanish. We discuss our technique to generate Spanish place names from a logical form where language-specific features, such as word order, or capitalization conventions do not exist. We finally present the details of a study that we carried out to help us make sound linguistic decisions in the generation of place names in Spanish.
pdf
abs
Ontology-based word sense disambiguation using semi-automatically constructed ontology
Sin-Jae Kang
|
Jong-Hyeok Lee
This paper describes a method for disambiguating word senses by using semi-automatically constructed ontology. The ontology stores rich semantic constraints among 1,110 concepts, and enables a natural language processing system to resolve semantic ambiguities by making inferences with the concept network of the ontology. In order to acquire a reasonably practical ontology in limited time and with less manpower, we extend the existing Kadokawa thesaurus by inserting additional semantic relations into its hierarchy, which are classified as case relations and other semantic relations. The former can be obtained by converting valency information and case frames from previously-built electronic dictionaries used in machine translation. The latter can be acquired from concept co-occurrence information, which is extracted automatically from large corpora. In our practical machine translation system, our word sense disambiguation method achieved a 9.2% improvement over methods which do not use an ontology for Korean translation.
pdf
abs
WASP-Bench: an MT lexicographers’ workstation supporting state-of-the-art lexical disambiguation
Adam Kilgarriff
|
David Tugwell
Most MT lexicography is devoted to developing rules of the kind, “in context C, translate source-language word S as target-language word T”. Very many such rules are required, producing them is laborious, and MT companies standardly spend large sums on it. We present the WASP-Bench, a lexicographer's workstation for the rapid and semi-automatic development of such rule-sets. The WASP-Bench makes use of a large source-language corpus and state-of-the-art techniques for Word Sense Disambiguation. We show that the WSD accuracy is on a par with the best results published to date, with the advantage that the WASP-Bench, unlike other high- performance systems, does not require a sense-disambiguated training corpus as input. The WASP-Bench is designed to fit readily with MT companies' working practices, as it may be used for as many or as few source language words as present disambiguation problems for a given target.
pdf
abs
A test suite for evaluation of English-to-Korean machine translation systems
Sungryong Koh
|
Jinee Maeng
|
Ji-Young Lee
|
Young-Sook Chae
|
Key-Sun Choi
This paper describes KORTERM’s test suite and their practicability. The test-sets have been being constructed on the basis of fine-grained classification of linguistic phenomena to evaluate the technical status of English-to-Korean MT systems systematically. They consist of about 5000 test-sets and are growing. Each test-set contains an English sentence, a model Korean translation, a linguistic phenomenon category, and a yes/no question about the linguistic phenomenon. Two commercial systems were evaluated with a yes/no test of prepared questions. Total accuracy rates of the two systems were different (50% vs. 66%). In addition, a comprehension test was carried out. We found that one system was more comprehensible than the other system. These results seem to show that our test suite is practicable.
pdf
abs
Integrating bilingual lexicons in a probabilistic translation assistant
Philippe Langlais
|
George Foster
|
Guy Lapalme
In this paper, we present a way to integrate bilingual lexicons into an operational probabilistic translation assistant (TransType). These lexicons could be any resource available to the translator (e.g. terminological lexicons) or any resource statistically derived from training material. We describe a bilingual lexicon acquisition process that we developped and we evaluate from a theoretical point of view its benefits to a translation completion task.
pdf
abs
SPANAM® and ENGSPAN® for Windows 2000: an MT pioneer keeps up with technology
Marjorie León
The Pan American Health Organization (PAHO) is proud to present the latest release of its fully automatic Spanish-to-English and English-to-Spanish machine translation systems. SPANAM and ENGSPAN have been ported to the 32-bit Windows platform. The bilingual graphical user interface provides easy access to all the features of the system. The translation engine can be accessed in three different ways: file translation from the desktop or word processing application, sentence translation from within the dictionary update module, or cut-and-paste translation using an ActiveX component. Any user can view all of PAHO's dictionary entries (words, expressions, and rules), and dictionary coders can add new entries of every type and modify all but a small number of protected records. The system is designed to be used by translation professionals in an institutional setting. Administrative utilities include job accounting, dictionary update log, terminology import and export, and dictionary merge. Users can view and print side-by-side listings of source and target texts, lists of not-found words, and the parse of any sentence.
pdf
abs
Combining tools to improve automatic translation
Terence Lewis
This paper takes a practical look at ways of combining language engineering tools to produce more accurate, “more human” automatic translations. Whilst specific products are discussed, the author believes that the methodology could be successfully implemented with different sets of tools.
pdf
abs
The Open Lexicon Interchange Format (OLIF) comes of age
Christian Lieske
|
Susan McCormick
|
Gregor Thurmair
This paper summarizes the current status of version 2 of the Open Lexicon Interchange Format (OLIF). As a natural extension of the OLIF prototype (OLIF version 1), version 2 has been modified with respect to content and formalization (e.g., it is now XML-compliant). These enhancements now make it possible to use OLIF in a variety of Natural Language Processing applications and general language technology environments (e.g., terminology management systems). At the time of writing, several industrial partners of the OLIF Consortium had already started work on implementing OLIF support. Details on OLIF can be found on www.olif.net.
pdf
abs
Utilizing agglutinative features in Japanese-Uighur machine translation
Muhtar Mahsut
|
Yasuhiro Ogawa
|
Kazue Sugino
|
Yasuyoshi Inagaki
Japanese and Uighur languages are agglutinative languages and they have many syntactical and morphological similarities. And roughly speaking, we can translate Japanese into Uighur sequentially by replacing Japanese words with corresponding Uighur ones after morphological analysis. However, we should translate agglutinated suffixes carefully to make correct translation, because they play important roles on both languages. In this paper, we pay attention to them and propose a Japanese-Uighur machine translation utilizing the agglutinative features of both languages. To deal with the agglutinative features, we use the derivational grammar, which makes the similarities clearer between both languages. This makes our system proposed here simple and systematical. We have implemented the machine translation system and evaluated how effectively our system works.
pdf
abs
Evaluation of machine translation systems at CLS Corporate Language Services AG
Elisabeth Maier
|
Anthony Clarke
|
Hans-Udo Stadler
This paper describes the evaluation of Machine Translation (MT) System for use in a large company. To take into account the specific requirements of such an environment, a pragmatic approach for the evaluation was developed. It consists of five steps ranging from a specification of the evaluation process to the integration of the chosen MT system in a given infrastructure. The process includes a specification of MT evaluation criteria relevant to systems which have to be employed for a large customer base. The paper also shows the results of such an evaluation study which was recently carried out at CLS Corporate Language Services AG, where COMPRENDIUM is in the meantime being employed as corporate MT system.
pdf
abs
Scaling the ISLE taxonomy: development of metrics for the multi-dimensional characterization of machine translation quality
Keith J. Miller
|
Michelle Vanni
The DARPA MT evaluations of the early 1990s, along with subsequent work on the MT Scale, and the International Standards for Language Engineering (ISLE) MT Evaluation framework represent two of the principal efforts in Machine Translation Evaluation (MTE) over the past decade. We describe a research program that builds on both of these efforts. This paper focuses on the selection of MT output features suggested in the ISLE framework, as well as the development of metrics for the features to be used in the study. We define each metric and describe the rationale for its development. We also discuss several of the finer points of the evaluation measures that arose as a result of verification of the measures against sample output texts from three machine translation systems.
pdf
abs
Pronominal anaphora resolution in KANTOO English-to-Spanish machine translation system
Teruko Mitamura
|
Eric Nyberg
|
Enrique Torrejon
|
David Svoboda
|
Kathryn Baker
We describe the automatic resolution of pronominal anaphora using KANT Controlled English (KCE) and the KANTOO English-to-Spanish MT system. Our algorithm is based on a robust, syntax-based approach that applies a set of restrictions and preferences to select the correct antecedent. We report a success rate of 89.6% on a training corpus with 289 anaphors, and 87.5% on held-out data containing 145 anaphors. Resolution of anaphors is important in translation, due to gender mismatches among languages; our approach translates anaphors to Spanish with 97.2% accuracy.
pdf
abs
Multiple argument ellipses resolution in Japanese
Shigeko Nariyama
Some Japanese clauses contain more than one argument ellipsis, and yet this fact has not adequately been accounted for in the study of ellipsis resolution in the current literature, which predominantly focus resolving one ellipsis per sentence. This paper proposes a method using a "salient referent list", which identifies the referents of such multiple argument ellipses as well as offers ellipsis resolution as a whole by considering contextual information.
pdf
abs
Morpho-syntactic analysis for reordering in statistical machine translation
Sonja Niessen
|
Hermann Ney
In the framework of statistical machine translation (SMT), correspondences between the words in the source and the target language are learned from bilingual corpora on the basis of so-called alignment models. Among other things these are meant to capture the differences in word order in different languages. In this paper we show that SMT can take advantage of the explicit introduction of some linguistic knowledge about the sentence structure in the languages under consideration. In contrast to previous publications dealing with the incorporation of morphological and syntactic information into SMT, we focus on two aspects of reordering for the language pair German and English, namely question inversion and detachable German verb prefixes. The results of systematic experiments are reported and demonstrate the applicability of the approach to both translation directions on a German-English corpus.
pdf
abs
Statistical multi-source translation
Franz Josef Och
|
Hermann Ney
We describe methods for translating a text given in multiple source languages into a single target language. The goal is to improve translation quality in applications where the ultimate goal is to translate the same document into many languages. We describe a statistical approach and two specific statistical models to deal with this problem. Our method is generally applicable as it is independent of specific models, languages or application domains. We evaluate the approach on a multilingual corpus covering all eleven official European Union languages that was collected automatically from the Internet. In various tests we show that these methods can significantly improve translation quality. As a side effect, we also compare the quality of statistical machine translation systems for many European languages in the same domain.
pdf
Implicit cues for explicit generation: using telicity as a cue for tense structure in a Chinese to English MT system
Mari Olsen
|
David Traum
|
Carol van Ess-Dykema
|
Amy Weinberg
pdf
abs
Translation knowledge recycling for related languages
Michael Paul
An increasing interest in multi-lingual translation systems demands a reconsideration of the development costs of machine translation engines for language pairs. This paper proposes an approach that reuses the existing translation knowledge resources of high-quality translation engines for translation into different, but related languages. The lexical information of the target representation is utilized to generate the corresponding translation in the related language by using a transfer dictionary for the mapping of words and a set of heuristic rules for the mapping of structural information. Experiments using a Japanese-English translation engine for the generation of German translations show a minor decrease of up to 5% in the acceptability of the German output compared with the English translation of unseen Japanese input.
pdf
abs
The Commission ́s MT system: today and tomorrow
Angeliki Petrits
|
Francine Braun-Chen
|
Jesús Manuel Martínez García
|
Cameron Ross
|
Rosemarie Sauer
|
Angelo Torquati
|
Alain Reichling
This paper presents a snapshot of how the Commission's MT system (EC SYSTRAN) is used today and a glimpse of how that picture will change tomorrow. It looks in turn at: the origins of the system; how it is accessed; who requests MT and why; how users can influence the quality of output; the Rapid Post-editing Service; and the latest usage statistics, which augur well for the future. The paper closes with a look at that future, touching on the move to a new computer platform and plans for new language pairs, concluding that after twenty-five years of development, MT has become an integral part of the Commission's working environment.
pdf
abs
Rapid assembly of a large-scale French-English MT system
Jessie Pinkham
|
Monica Corston-Oliver
|
Martine Smets
|
Martine Pettenaro
Past research has shown that the ideal MT system should be modular and devoid of language pair specific information in its design. We describe here the assembly of TAMTAM (Traduction Automatique Microsoft), the French-English research MT system under development at Microsoft, which was constructed from a combination of pre-existing rule-based components and automatically created components. At this stage, the system has not been adapted either computationally or linguistically to the French-English context and yet it performs only slightly below the French-English Systran system in independent blind human evaluations
pdf
abs
Ape: reducing the monkey business in post-editing by automating the task intelligently
Claus Povlsen
|
Annelise Bech
For a professional user of MT, quality, performance and cost efficiency are critical. It is therefore surprising that only little attention – both in theory and in practice - has been given to the task of post-editing machine translated texts. This paper will focus on this important user aspect and demonstrate that substantial savings in time and effort can be achieved by implementing intelligent automatic tools. Our point of departure is the PaTrans MT-system, developed by CST and used by the Danish translation company Lingtech. An intelligent post-editing facility, Ape, has been developed and added to the system. We will outline and discuss this mechanism and its positive effects on the output. The underlying idea of the intelligent post-editing facility is to exploit the lexical and grammatical knowledge already present in the MT-system’s linguistic components. Conceptually, our approach is general, although its implementation remains system specific. Surveys of post-editor satisfaction and cost-efficiency improvements, as well as a quantitative, benchmark-based evaluation of the effect of Ape demonstrate the success of the approach and encourage further development.
pdf
abs
Cognates alignment
António Ribeiro
|
Gaël Dias
|
Gabriel Lopes
|
João Mexia
Some authors (Simard et al.; Melamed; Danielsson & Mühlenbock) have suggested measures of similarity of words in different languages so as to find extra clues for alignment of parallel texts. Cognate words, like ‘Parliament’ and ‘Parlement’, in English and French respectively, provide extra anchors that help to improve the quality of the alignment. In this paper, we will extend an alignment algorithm proposed by Ribeiro et al. using typical contiguous and non-contiguous sequences of characters extracted using a statistically sound method (Dias et al.). With these typical sequences, we are able to find more reliable correspondence points and improve the alignment quality without recurring to heuristics to identify cognates.
pdf
abs
Achieving commercial-quality translation with example-based methods
Stephen Richardson
|
William Dolan
|
Arul Menezes
|
Jessie Pinkham
We describe MSR-MT, a large-scale example-based machine translation system under development for several language pairs. Trained on aligned English-Spanish technical prose, a blind evaluation shows that MSR-MT’s integration of rule-based parsers, example based processing, and statistical techniques produces translations whose quality in this domain exceeds that of uncustomized commercial MT systems.
pdf
abs
Managing translation and localisation projects with LTC Organiser
Adriane Rinsche
Using an invented case study, the paper describes how multilingual translation projects can be managed efficiently with an enterprise resource management tool called “LTC Organiser”, which was developed specifically for the particular requirements of the language industry. The talk will describe the most important aspects of the integrated solution, such as client and supplier management, project and finance management, managing tools used in the translation process, reporting facilities, security and user management, directory management, sort and search facilities as well as web functionality available at several levels.
pdf
abs
A morphological analyser for machine translation based on finite-state transducers
Alberto Sanchis
|
David Picó
|
Joan Miquel del Val
|
Ferran Fabregat
|
Jesús Tomás
|
Moisés Pastor
|
Francisco Casacuberta
|
Enrique Vidal
A finite-state, rule-based morphological analyser is presented here, within the framework of machine translation system TAVAL. This morphological analyser introduces specific features which are particularly useful for translation, such as the detection and morphological tagging of word groups that act as a single lexical unit for translation purposes. The case where words in one such group are not strictly contiguous is also covered. A brief description of the Spanish-to-Catalan and Catalan-to-Spanish translation system TAVAL is given in the paper.
pdf
abs
New generation Systran translation system
Jean Senellart
|
Péter Dienes
|
Tamás Váradi
In this paper, we present the design of the new generation Systran translation systems, currently utilized in the development of English-Hungarian, English-Polish, English-Arabic, French-Arabic, Hungarian-French and Polish-French language pairs. The new design, based on the traditional Systran machine translation expertise and the existing linguistic resources, addresses the following aspects: efficiency, modularity, declarativity, reusability, and maintainability. Technically, the new systems rely on intensive use of state-of-the-art finite automaton and formal grammar implementation. The finite automata provide the essential lookup facilities and the natural capacity of factorizing intuitive linguistic sets. Linguistically, we have introduced a full monolingual description of linguistic information and the concept of implicit transfer. Finally, we present some by-products that are directly derived from the new architecture: intuitive coding tools, spell checker and syntactic tagger.
pdf
abs
Resource alignment for machine translation or implicit transfer
Jean Senellart
|
Mirko Plitt
|
Christophe Bailly
|
Françoise Cardoso
In this article we present the concept of “implicit transfer” rules. We will show that they represent a valid compromise between huge direct transfer terminology lists and large sets of transfer rules, which are very complex to maintain. We present a concrete, real-life application of this concept in a customization project (TOLEDO project) concerning the automatic translation of Autodesk (ADSK) support pages. In this application, the alignment is moreover combined with a graph representation substituting linear dictionaries. We show how the concept could be extended to increase coverage of traditional translation dictionaries as well as to extract terminology from large existing multilingual corpora. We also introduce the concept of "alignment dictionary" which seems promising in its ability to extend the pragmatic limits of multilingual dictionary management.
pdf
CaptionEye/EK: a English-to-Korean caption translation system using the sentence pattern
Young-Ae Seo
|
Yoon-Hyung Roh
|
Ki-Young Lee
|
Sang-Kyu Park
pdf
abs
Collaborative translation environment on the Web
Sayori Shimohata
|
Mihoko Kitamura
|
Tatsuya Sukehiro
|
Toshiki Murata
This paper describes a comprehensive translation environment build on the Internet. This environment is designed not only to translate web pages but also to support translation work on the web. We first introduce a basic idea and implementation of this environment and then compare it to conventional machine translation (MT) systems available on the web and translation memories.
pdf
abs
Sub-sentential exploitation of translation memories
Michel Simard
|
Philippe Langlais
Translation memory systems (TMS) are a family of computer tools whose purpose is to facilitate and encourage the re-use of existing translations. By searching a database of past translations, these systems can retrieve the translation of whole segments of text and propose them to the translator for re-use. However, the usefulness of existing TMS’s is limited by the nature of the text segments that that they are able to put in correspondence, generally whole sentences. This article examines the potential of a type of system that is able to recuperate the translation of sub-sentential sequences of words.
pdf
abs
Using information technology to optimise translation processes at PricewaterhouseCoopers Madrid
Ross Smith
This paper describes how information technology is used by the Translation Department of PricewaterhouseCoopers in Madrid to optimise translation processes. It commences by describing a mechanism for handling workflow via the corporate network, designed to maximise speed and efficiency in translation requests and also to function as an automated record for administration purposes. This is followed by an appraisal of the CAT system used in the Translation Department, namely the Trados Workbench and related applications. Finally, an ongoing project for making MT (Systran) available to PwC employees around the world over the Firm's intranet is outlined.
pdf
abs
Precise measurement method of a speech translation system’s capability with a paired comparison method between the system and humans
Fumiaki Sugaya
|
Keiji Yasuda
|
Toshiyuki Takezawa
|
Seiichi Yamamoto
The main goal of the present paper is to propose new schemes for the overall evaluation of a speech translation system. These schemes are expected to support and improve the design of the target application system, and precisely determine its performance. Experiments are conducted on the Japanese-to-English speech translation system ATR-MATRIX, which was developed at ATR Interpreting Telecommunications Research Laboratories. In the proposed schemes, the system’s translations are compared with those of a native Japanese taking the Test of English for International Communication (TOEIC), which is used as a measure of one’s speech translation capability. Subjective and automatic comparisons are made and the results are compared. A regression analysis on the subjective results shows that the speech translation capability of ATR-MATRIX matches a Japanese person scoring around 500 on the TOEIC. The automatic comparisons also show promising results.
pdf
abs
Converting a bilingual dictionary into a bilingual knowledge bank based on the synchronous SSTC
Enya Kong Tang
|
Mosleh H. Al-Adhaileh
In this paper, we would like to present an approach to construct a huge Bilingual Knowledge Bank (BKB) from an English Malay bilingual dictionary based on the idea of synchronous Structured String-Tree Correspondence (SSTC). The SSTC is a general structure that can associate an arbitrary tree structure to string in a language as desired by the annotator to be the interpretation structure of the string, and more importantly is the facility to specify the correspondence between the string and the associated tree which can be non-projective. With this structure, we are able to match linguistic units at different inter levels of the structure (i.e. define the correspondence between substrings in the sentence, nodes in the tree, subtrees in the tree and sub-correspondences in the SSTC). This flexibility makes synchronous SSTC very well suited for the construction of a Bilingual Knowledge Bank we need for the English-Malay MT application.
pdf
abs
Monotone statistical translation using word groups
Jesús Tomás
|
Francisco Casacuberta
A new system for statistical natural language translation for languages with similar grammar is introduced. Specifically, it can be used with Romanic Languages, such as French, Spanish or Catalan. The statistical translation uses two sources of information: a language model and a translation model. The language model used is a standard trigram model. A new approach is defined in the translation model. The two main properties of the translation model are: the translation probabilities are computed between groups of words and the alignment between those groups is monotone. That is, the order between the word groups in the source sentence is conserved in the target sentence. Once, the translation model has been defined, we present an algorithm to infer its parameters from training samples. The translation process is carried out with an efficient algorithm based on stack-decoding. Finally, we present some translation results from Catalan to Spanish and compare our model with other conventional models.
pdf
abs
Translatability checker: a tool to help decide whether to use MT
Nancy Underwood
|
Bart Jongejan
This paper describes a tool designed to assess the machine translatability of English source texts by assigning a translatability index to both individual sentences and the text as a whole. The tool is designed to be both stand-alone and integratable into a suite of other tools which together help to improve the quality of professional translation in the preparatory phase of the translation workflow. Assessing translatability is an important element in ensuring the most efficient and cost effective use of current translation technology, and the tool must be able to quickly determine the translatability of a text without itself using too many resources. It is therefore based on rather simple tagging and pattern matching technologies which bring with them a certain level of indeterminacy. This potential disadvantage can, however, be offset by the fact that an annotated version of the text is simultaneously produced to allow the user to interpret the results of the checker.
pdf
abs
Sentence boundary detection: a comparison of paradigms for improving MT quality
Daniel J. Walker
|
David E. Clements
|
Maki Darwin
|
Jan W. Amtrup
The reliable detection of sentence boundaries in running text is one of the first important steps in preparing an input document for translation. Although this is often neglected, it is necessary to obtain a translation with a high degree of quality. In this paper, we present a comparison of different paradigms for the detection of sentence boundaries in written text. We compare three different approaches: Directly encoding the knowledge in a program, a rule-based system relying on regular expressions to describe boundaries, and a statistical maximum-entropy learning algorithm to obtain knowledge about boundaries. Using the statistical system, we obtain a recall of 98.14%, classifying boundaries of six types, and using a training corpus of under 10,000 sentences.
pdf
abs
An automatic evaluation method of translation quality using translation answer candidates queried from a parallel corpus
Keiji Yasuda
|
Fumiaki Sugaya
|
Toshiyuki Takezawa
|
Seiichi Yamamoto
|
Masuzo Yanagida
An automatic translation quality evaluation method is proposed. In the proposed method, a parallel corpus is used to query translation answer candidates. The translation output is evaluated by measuring the similarity between the translation output and translation answer candidates with DP matching. This method evaluates a language translation subsystem of the Japanese-to-English ATR-MATRIX speech translation system developed at ATR Interpreting Telecommunications Research Laboratories. Discriminant analysis is then carried out to examine the evaluation performance of the proposed method. Experimental results show the effectiveness of the proposed method. The discriminant ratio is 83.5% for 2-class discrimination between absolutely correct and less appropriate translations classified subjectively. Also discussed are issues of the proposed method when it is applied to speech translation systems which inevitably make recognition errors.
pdf
abs
An automatic evaluation method for machine translation using two-way MT
Shoichi Yokoyama
|
Hideki Kashioka
|
Akira Kumano
|
Masaki Matsudaira
|
Yoshiko Shirokizawa
|
Shuji Kodama
|
Terumasa Ehara
|
Shinichiro Miyazawa
|
Yuzo Murata
Evaluation of machine translation is one of the most important issues in this field. We have already proposed a quantitative evaluation of machine translation system. The method was roughly that an example sentence in Japanese is machine translated into English, and then into Japanese using several systems, and that the comparison of output Japanese sentences with the original Japanese sentence is done for the word identification, the correctness of the modification, the syntactic dependency, and the parataxis. By calculating the score, we could quantitatively evaluate the English machine translation. However, the extraction of word identification etc. was done by human, and the fact affects the correctness of evaluation. In order to solve this problem, we developed an automatic evaluation system. We report the detail of the system in this paper..
pdf
abs
Pre-processing of bilingual corpora for Mandarin-English EBMT
Ying Zhang
|
Ralf Brown
|
Robert Frederking
|
Alon Lavie
Pre-processing of bilingual corpora plays an important role in Example-Based Machine Translation (EBMT) and Statistical-Based Machine Translation (SBMT). For our Mandarin-English EBMT system, pre-processing includes segmentation for Mandarin, bracketing for English and building a statistical dictionary from the corpora. We used the Mandarin segmenter from the Linguistic Data Consortium (LDC). It uses dynamic programming with a frequency dictionary to segment the text. Although the frequency dictionary is large, it does not completely cover the corpora. In this paper, we describe the work we have done to improve the segmentation for Mandarin and the bracketing process for English to increase the length of English phrases. A statistical dictionary is built from the aligned bilingual corpus. It is used as feedback to segmentation and bracketing to re-segment / re-bracket the corpus. The process iterates several times to achieve better results. The final results of the corpus pre-processing are a segmented/bracketed aligned bilingual corpus and a statistical dictionary. We achieved positive results by increasing the average length of Chinese terms about 60% and 10% for English. The statistical dictionary gained about a 30% increase in coverage.