Proceedings of Machine Translation Summit X: Papers

Anthology ID:
September 13-15
Phuket, Thailand
Bib Export formats:

pdf bib
Extracting Representative Arguments from Dictionaries for Resolving Zero Pronouns
Shigeko Nariyama | Eric Nichols | Francis Bond | Takaaki Tanaka | Hiromi Nakaiwa

We propose a method to alleviate the problem of referential granularity for Japanese zero pronoun resolution. We use dictionary definition sentences to extract ‘representative’ arguments of predicative definition words; e.g. ‘arrest’ is likely to take police as the subject and criminal as its object. These representative arguments are far more informative than ‘person’ that is provided by other valency dictionaries. They are auto-extracted using both Shallow parsing and Deep parsing for greater quality and quantity. Initial results are highly promising, obtaining more specific information about selectional preferences. An architecture of zero pronoun resolution using these representative arguments is described.

pdf bib
Selection of Entries for a Bilingual Dictionary from Aligned Translation Equivalents using Support Vector Machines
Takeshi Kutsumi | Takehiko Yoshimi | Katsunori Kotani | Ichiko Sata | Hitoshi Isahara

This paper claims that constructing a dictionary using bilingual pairs obtained from parallel corpora needs not only correct alignment of two noun phrases but also judgment of its appropriateness as an entry. It specifically addresses the latter task, which has been paid little attention. It demonstrates a method of selecting a suitable entry using Support Vector Machines, and proposes to regard as the features the common and the different parts between a current translation and a new translation. Using experiment results, this paper examines how selection performances are affected by the four ways of representing the common and the different parts: morphemes, parts of speech, semantic markers, and upper-level semantic markers. Moreover, we used n-grams of the common and the different parts of above four kinds of features. Experimental result found that representation by morphemes marked the best performance, F-measure of 0.803.

pdf bib
Subword Clusters as Light-Weight Interlingua for Multilingual Document Retrieval
Udo Hahn | Kornel Marko | Stefan Schulz

We introduce a light-weight interlingua for a cross-language document retrieval system in the medical domain. It is composed of equivalence classes of semantically primitive, language-specific subwords which are clustered by interlingual and intralingual synonymy. Each subword cluster represents a basic conceptual entity of the language-independent interlingua. Documents, as well as queries, are mapped to this interlingua level on which retrieval operations are performed. Evaluation experiments reveal that this interlingua-based retrieval model outperforms a direct translation approach.

Example-based Machine Translation Based on TSC and Statistical Generation
Zhanyi Liu | Haifeng Wang | Hua Wu

This paper proposes a novel Example-Based Machine Translation (EBMT) method based on Tree String Correspondence (TSC) and statistical generation. In this method, the translation examples are represented as TSC, which consists of three parts: a parse tree in the source language, a string in the target language, and the correspondences between the leaf nodes of the source language tree and the substrings of the target language string. During the translation, the input sentence is first parsed into a tree. Then the TSC forest is searched out if it is best matched with the parse tree. The translation is generated by using a statistical generation model to combine the target language strings in the TSCs. The generation model consists of three parts: the semantic similarity between words, the word translation probability, and the target language model. Based on the above method, we build an English-to-Chinese Machine Translation (ECMT) system. Experimental results indicate that the performance of our system is comparable with that of the state-of-the-art commercial ECMT systems.

Learning Translations from Monolingual Corpora
Hirokazu Suzuki | Akira Kumano

This paper proposes a method for a machine translation (MT) system to automatically select and learn translation words, which suit the user’s tastes or document fields by using a monolingual corpus manually compiled by the user, in order to achieve high-quality translation. We have constructed a system based on this method and carried out experiments to prove the validity of the proposed method.

A Practical of Memory-based Approach for Improving Accuracy of MT
Sitthaa Phaholphinyo | Teerapong Modhiran | Nattapol Kritsuthikul | Thepchai Supnithi

Rule-Based Machine Translation (RBMT) [1] approach is a major approach in MT research. It needs linguistic knowledge to create appropriate rules of translation. However, we cannot completely add all linguistic rules to the system because adding new rules may cause a conflict with the old ones. So, we propose a memory based approach to improve the translation quality without modifying the existing linguistic rules. This paper analyses the translation problems and shows how this approach works.

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting
Andre Castilla | Alice Bacic | Sergio Furuie

The main objective of our project is to extract clinical information from thoracic radiology reports in Portuguese using Machine Translation (MT) and cross language information retrieval techniques. To accomplish this task we need to evaluate the involved machine translation system. Since human MT evaluation is costly and time consuming we opted to use automated methods. We propose an evaluation methodology using NIST/BLEU and METEOR algorithms and a controlled medical vocabulary, the Unified Medical Language System (UMLS). A set of documents are generated and they are either machine translated or used as evaluation references. This methodology is used to evaluate the performance of our specialized Portuguese-English translation dictionary. A significant improvement on evaluation scores after the dictionary incorporation into a commercial MT system is demonstrated. The use of UMLS and automated MT evaluation techniques can help the development of applications on the medical domain. Our methodology can also be used on general MT research for evaluating and testing purposes.

A Report on the Machine Translation Market in Japan
Setsuo Yamada | Syuuji Kodama | Taeko Matsuoka | Hiroshi Araki | Yoshiaki Murakami | Osamu Takano | Yoshiyuki Sakamoto

When conducting market research on machine translation, we research the volume of sales continuously in order to determine the scale of the machine translation market in Japan. We have officially announced these figures every year. Furthermore, since 2003, we administered questionnaires regarding the Web translation.

Document Authoring the Bible for Minority Language Translation
Stephen Beale | Sergei Nirenburg | Marjorie McShane | Tod Allman

This paper describes one approach to document authoring and natural language generation being pursued by the Summer Institute of Linguistics in cooperation with the University of Maryland, Baltimore County. We will describe the tools provided for document authoring, including a glimpse at the underlying controlled language and the semantic representation of the textual meaning. We will also introduce The Bible Translator’s Assistant© (TBTA), which is used to elicit and enter target language data as well as perform the actual text generation process. We conclude with a discussion of the usefulness of this paradigm from a Bible translation perspective and suggest several ways in which this work will benefit the field of computational linguistics.

Building an Annotated Japanese-Chinese Parallel Corpus – A Part of NICT Multilingual Corpora
Yujie Zhang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated Japanese-Chinese parallel corpus in the world. We created the corpus by selecting Japanese sentences from Mainichi Newspaper and then manually translating them into Chinese. We then annotated the corpus with morphological and syntactic structures and alignments at word and phrase levels. This paper describes the specification in human translation and detailed information annotation, and the tools we developed in the project. The experience we obtained and points we paid special attentions are also introduced for share with other researches in corpora construction.

Europarl: A Parallel Corpus for Statistical Machine Translation
Philipp Koehn

We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.

Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries
Patanakul Sathapornrungkij | Charnyote Pluempitiwiriyawej

We describe a method of constructing Thai WordNet, a lexical database in which Thai words are organized by their meanings. Our methodology takes WordNet and LEXiTRON machine-readable dictionaries into account. The semantic relations between English words in WordNet and the translation relations between English and Thai words in LEXiTRON are considered. Our methodology is operated via WordNet Builder system. This paper provides an overview of the WordNet Builder architecture and reports on some of our experience with the prototype implementation.

Augmentation of Modality Translation Rules in Korean-to-English Machine Translation by Rule Learning
Seong-Bae Park | Jeong-Woo Son | Yoon-Shik Tae

Semantically Relatable Sets: Building Blocks for Representing Semantics
Rajat Kumar Mohanty | Anupama Dutta | Pushpak Bhattacharyya

Maximum Entropy Models for Realization Ranking
Erik Velldal | Stephan Oepen

In this paper we describe and evaluate different statistical models for the task of realization ranking, i.e. the problem of discriminating between competing surface realizations generated for a given input semantics. Three models are trained and tested; an n-gram language model, a discriminative maximum entropy model using structural features, and a combination of these two. Our realization component forms part of a larger, hybrid MT system.

Evaluation of Machine Translation with Predictive Metrics beyond BLEU/NIST: CESTA Evaluation Campaign # 1
Sylvain Surcin | Olivier Hamon | Antony Hartley | Martin Rajman | Andrei Popescu-Belis | Widad Mustafa El Hadi | Ismaïl Timimi | Marianne Dabbadie | Khalid Choukri

In this paper, we report on the results of a full-size evaluation campaign of various MT systems. This campaign is novel compared to the classical DARPA/NIST MT evaluation campaigns in the sense that French is the target language, and that it includes an experiment of meta-evaluation of various metrics claiming to better predict different attributes of translation quality. We first describe the campaign, its context, its protocol and the data we used. Then we summarise the results obtained by the participating systems and discuss the meta-evaluation of the metrics used.

Inter-rater Agreement Measures, and the Refinement of Metrics in the PLATO MT Evaluation Paradigm
Keith J. Miller | Michelle Vanni

A Multi-aligner for Japanese-Chinese Parallel Corpora
Yujie Zhang | Qun Liu | Qing Ma | Hitoshi Isahara

Automatic word alignment is an important technology for extracting translation knowledge from parallel corpora. However, automatic techniques cannot resolve this problem completely because of variances in translations. We therefore need to investigate the performance potential of automatic word alignment and then decide how to suitably apply it. In this paper we first propose a lexical knowledge-based approach to word alignment on a Japanese-Chinese corpus. Then we evaluate the performance of the proposed approach on the corpus. At the same time we also apply a statistics-based approach, the well-known toolkit GIZA++, to the same test data. Through comparison of the performances of the two approaches, we propose a multi-aligner, exploiting the lexical knowledge-based aligner and the statistics-based aligner at the same time. Quantitative results confirmed the effectiveness of the multi-aligner.

Thot: a Toolkit To Train Phrase-based Statistical Translation Models
Daniel Ortiz-Martínez | Ismael García-Varea | Francisco Casacuberta

In this paper, we present the Thot toolkit, a set of tools to train phrase-based models for statistical machine translation, which is publicly available as open source software. The toolkit obtains phrase-based models from word-based alignment models; to our knowledge, this functionality has not been offered by any publicly available toolkit. The Thot toolkit also implements a new way for estimating phrase models, this allows to obtain more complete phrase models than the methods described in the literature, including a segmentation length submodel. The toolkit output can be given in different formats in order to be used by other statistical machine translation tools like Pharaoh, which is a beam search decoder for phrase-based alignment models which was used in order to perform translation experiments with the generated models. Additionally, the Thot toolkit can be used to obtain the best alignment between a sentence pair at phrase level.

Machine Translation of Bi-lingual Hindi-English (Hinglish) Text
R. Mahesh K. Sinha | Anil Thakur

In the present communication-based society, no natural language seems to have been left untouched by the trends of code-mixing. For different communicative purposes, a language uses linguistic codes from other languages. This gives rise to a mixed language which is neither totally the host language nor the foreign language. The mixed language poses a new challenge to the problem of machine translation. It is necessary to identify the “foreign” elements in the source language and process them accordingly. The foreign elements may not appear in their original form and may get morphologically transformed as per the host language. Further, in a complex sentence, a clause/utterance may be in the host language while another clause/utterance may be in the foreign language. Code-mixing of Hindi and English where Hindi is the host language, is a common phenomenon in day-to-day language usage in Indian metropolis. The scenario is so common that people have started considering this a different variety altogether and calling it by the name Hinglish. In this paper, we present a mechanism for machine translation of Hinglish to pure (standard) Hindi and pure English forms.

Dealing with Replicative Words in Hindi for Machine Translation to English
R. Mahesh | K. Sinha | Anil Thakur

The South Asian languages are well-known for their replicative words. In these languages, words of almost all the grammatical categories can occur in their reduplicative form. Hindi is one such language which is quite rich in having various types of replicative words in its lexicon. The traditional grammars and some of the research works have discussed the topic to some extent, particularly from the point of view of their descriptions and classifications. However, a detailed study of the topic becomes significant in view of the complexity involved in handling of such replicative words in the area of natural language processing, particularly for machine translation. In this paper, we discuss different types of replicative words in Hindi and their syntactic and semantic characteristics to formulate rules and strategies to identify their multiple functions and mapping patterns in English for machine translation from Hindi to English.

SEM-I Rational MT: Enriching Deep Grammars with a Semantic Interface for Scalable Machine Translation
Dan Flickinger | Jan Tore Lønning | Helge Dyvik | Stephan Oepen | Francis Bond

In the LOGON machine translation system where semantic transfer using Minimal Recursion Semantics is being developed in conjunction with two existing broad-coverage grammars of Norwegian and English, we motivate the use of a grammar-specific semantic interface (SEM-I) to facilitate the construction and maintenance of a scalable translation engine. The SEM-I is a theoretically grounded component of each grammar, capturing several classes of lexical regularities while also serving the crucial engineering function of supplying a reliable and complete specification of the elementary predications the grammar can realize. We make extensive use of underspecification and type hierarchies to maximize generality and precision.

DEMOCRAT: Deciding between Multiple Outputs Created by Automatic Translation
Menno van Zaanen | Harold Somers

Customizing a Korean-English MT System for Patent Translation
Munpyo Hong | Young-Gil Kim | Chang-Hyun Kim | Seong-Il Yang | Young-Ae Seo | Cheol Ryu | Sang-Kyu Park

This paper addresses a customization process of a Korean-English MT system for patent translation. The major customization steps include terminology construction, linguistic study, and the modification of the existing analysis and generation-module. T o our knowledge, this is the first worth-mentioning large-scale customization effort of an MT system for Korean and English. This research was performed under the auspices of the MIC (Ministry of Information and Communication) of Korean government. A prototype patent MT system for electronics domain was installed and is being tested in the Korean Intellectual Property Office.

Practicing Controlled Language through a Help System integrated into the Medical Speech Translation System (MedSLT)
Marianne Starlander | Pierrette Bouillon | Nikos Chatzichrisafis | Marianne Santaholma | Manny Rayner | Beth Ann Hockey | Hitoshi Isahara | Kyoko Kanzaki | Yukie Nakao

In this paper, we present evidence that providing users of a speech to speech translation system for emergency diagnosis (MedSLT) with a tool that helps them to learn the coverage greatly improves their success in using the system. In MedSLT, the system uses a grammar-based recogniser that provides more predictable results to the translation component. The help module aims at addressing the lack of robustness inherent in this type of approach. It takes as input the result of a robust statistical recogniser that performs better for out-of-coverage data and produces a list of in-coverage example sentences. These examples are selected from a defined list using a heuristic that prioritises sentences maximising the number of N-grams shared with those extracted from the recognition result.

The FAME Speech-to-Speech Translation System for Catalan, English, and Spanish
Victoria Arranz | Elisabet Comelles | David Farwell

This paper describes the evaluation of the FAME interlingua-based speech-to-speech translation system for Catalan, English and Spanish. This system is an extension of the already existing NESPOLE! that translates between English, French, German and Italian. This article begins with a brief introduction followed by a description of the system architecture and the components of the translation module including the Speech Recognizer, the analysis chain, the generation chain and the Speech Synthesizer. Then we explain the interlingua formalism used, called Interchange Format (IF). We show the results obtained from the evaluation of the system and we describe the three types of evaluation done. We also compare the results of our system with those obtained by a stochastic translator which has been independently developed over the course of the FAME project. Finally, we conclude with future work.

Assessing Degradation of Spoken Language Translation by Measuring Speech Recognizer’s Output against Non-native Speakers’ Listening Capabilities
Toshiyuki Takezawa | Keiji Yasuda | Masahide Mizushima | Genichiro Kikui

Integration of SYSTRAN MT Systems in an Open Workflow
Mats Attnäs | Pierre Senellart | Jean Senellart

Probabilistic Model for Example-based Machine Translation
Eiji Aramaki | Sadao Kurohashi | Hideki Kashioka | Naoto Kato

Example-based machine translation (EBMT) systems, so far, rely on heuristic measures in retrieving translation examples. Such a heuristic measure costs time to adjust, and might make its algorithm unclear. This paper presents a probabilistic model for EBMT. Under the proposed model, the system searches the translation example combination which has the highest probability. The proposed model clearly formalizes EBMT process. In addition, the model can naturally incorporate the context similarity of translation examples. The experimental results demonstrate that the proposed model has a slightly better translation quality than state-of-the-art EBMT systems.

Low Cost Portability for Statistical Machine Translation based on N-gram Coverage
Matthias Eck | Stephan Vogel | Alex Waibel

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.

Automatic Rating of Machine Translatability
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara

We describe a method for automatically rating the machine translatability of a sentence for various machine translation (MT) systems. The method requires that the MT system can bidirectionally translate sentences in both source and target languages. However, it does not require reference translations, as is usual for automatic MT evaluation. By applying this method to every component of a sentence in a given source language, we can automatically identify the machine-translatable and non-machinetranslatable parts of a sentence for a particular MT system. We show that the parts of a sentence that are automatically identified as nonmachine-translatable provide useful information for paraphrasing or revising the sentence in the source language, thus improving the quality of the final translation.

Learning Phrase Translation using Level of Detail Approach
Hendra Setiawan | Haizhou Li | Min Zhang

We propose a simplified Level Of Detail (LOD) algorithm to learn phrase translation for statistical machine translation. In particular, LOD learns unknown phrase translations from parallel texts without linguistic knowledge. LOD uses an agglomerative method to attack the combinatorial explosion that results when generating candidate phrase translations. Although LOD was previously proposed by (Setiawan et al., 2005), we improve the original algorithm in two ways: simplifying the algorithm and using a simpler translation model. Experimental results show that our algorithm provides comparable performance while demonstrating a significant reduction in computation time.

PESA: Phrase Pair Extraction as Sentence Splitting
Stephan Vogel

Most statistical machine translation systems use phrase-to-phrase translations to capture local context information, leading to better lexical choice and more reliable local reordering. The quality of the phrase alignment is crucial to the quality of the resulting translations. Here, we propose a new phrase alignment method, not based on the Viterbi path of word alignment models. Phrase alignment is viewed as a sentence splitting task. For a given spitting of the source sentence (source phrase, left segment, right segment) find a splitting for the target sentence, which optimizes the overall sentence alignment probability. Experiments on different translation tasks show that this phrase alignment method leads to highly competitive translation results.

Statistical Machine Translation of European Parliamentary Speeches
David Vilar | Evgeny Matusov | Sasa Hasan | Richard Zens | Hermann Ney

In this paper we present the ongoing work at RWTH Aachen University for building a speech-to-speech translation system within the TC-Star project. The corpus we work on consists of parliamentary speeches held in the European Plenary Sessions. To our knowledge, this is the first project that focuses on speech-to-speech translation applied to a real-life task. We describe the statistical approach used in the development of our system and analyze its performance under different conditions: dealing with syntactically correct input, dealing with the exact transcription of speech and dealing with the (noisy) output of an automatic speech recognition system. Experimental results show that our system is able to perform adequately in each of these conditions.

Practical Approach to Syntax-based Statistical Machine Translation
Kenji Imamura | Hideo Okuma | Eiichiro Sumita

This paper presents a practical approach to statistical machine translation (SMT) based on syntactic transfer. Conventionally, phrase-based SMT generates an output sentence by combining phrase (multiword sequence) translation and phrase reordering without syntax. On the other hand, SMT based on tree-to-tree mapping, which involves syntactic information, is theoretical, so its features remain unclear from the viewpoint of a practical system. The SMT proposed in this paper translates phrases with hierarchical reordering based on the bilingual parse tree. In our experiments, the best translation was obtained when both phrases and syntactic information were used for the translation process.

Bilingual N-gram Statistical Machine Translation
José B. Mariño | Rafael E. Banchs | Josep M. Crego | Adrià de Gispert | Patrik Lambert | José A. R. Fonollosa | Marta Ruiz

This paper describes a statistical machine translation system that uses a translation model which is based on bilingual n-grams. When this translation model is log-linearly combined with four specific feature functions, state of the art translations are achieved for Spanish-to-English and English-to-Spanish translation tasks. Some specific results obtained for the EPPS (European Parliament Plenary Sessions) data are presented and discussed. Finally, future research issues are depicted.

Reordered Search, and Tuple Unfolding for Ngram-based SMT
Josep M. Crego | José B. Mariño | Adrià de Gispert

In Statistical Machine Translation, the use of reordering for certain language pairs can produce a significant improvement on translation accuracy. However, the search problem is shown to be NP-hard when arbitrary reorderings are allowed. This paper addresses the question of reordering for an Ngram-based SMT approach following two complementary strategies, namely reordered search and tuple unfolding. These strategies interact to improve translation quality in a Chinese to English task. On the one hand, we allow for an Ngram-based decoder (MARIE) to perform a reordered search over the source sentence, while combining a translation tuples Ngram model, a target language model, a word penalty and a word distance model. Interestingly, even though the translation units are learnt sequentially, its reordered search produces an improved translation. On the other hand, we allow for a modification of the translation units that unfolds the tuples, so that shorter units are learnt from a new parallel corpus, where the source sentences are reordered according to the target language. This tuple unfolding technique reduces data sparseness and, when combined with the reordered search, further boosts translation performance. Translation accuracy and efficency results are reported for the IWSLT 2004 Chinese to English task.

Improving Online Machine Translation Systems
Bart Mellebeek | Anna Khasin | Karolina Owczarzak | Josef Van Genabith | Andy Way

In (Mellebeek et al., 2005), we proposed the design, implementation and evaluation of a novel and modular approach to boost the translation performance of existing, wide-coverage, freely available machine translation systems, based on reliable and fast automatic decomposition of the translation input and corresponding composition of translation output. Despite showing some initial promise, our method did not improve on the baseline Logomedia1 and Systran2 MT systems. In this paper, we improve on the algorithm presented in (Mellebeek et al., 2005), and on the same test data, show increased scores for a range of automatic evaluation metrics. Our algorithm now outperforms Logomedia, obtains similar results to SDL3 and falls tantalisingly short of the performance achieved by Systran.

The Effect of Adding Rules into the Rule-based MT System
Zhu Jiang | Wang Haifeng

This paper investigates the relationship between the amount of the rules and the performance of the rule-based machine translation system. We keep adding more rules into the system and observe successive changes of the translation quality. Evaluations on translation quality reveal that the more the rules, the better the translation quality. A linear regression analysis shows that a positive linear relationship exists between the translation quality and the amount of the rules. We use this linear model to make prediction and test the prediction with newly developed rules. Experimental results indicate that the linear model effectively predicts the possible performance that the rule-based machine translation system may achieve with more rules added.

Cognates and Word Alignment in Bitexts
Grzegorz Kondrak

We evaluate several orthographic word similarity measures in the context of bitext word alignment. We investigate the relationship between the length of the words and the length of their longest common subsequence. We present an alternative to the longest common subsequence ratio (LCSR), a widely-used orthographic word similarity measure. Experiments involving identification of cognates in bitexts suggest that the alternative method outperforms LCSR. Our results also indicate that alignment links can be used as a substitute for cognates for the purpose of evaluating word similarity measures.

Boosting Statistical Word Alignment
Hua Wu | Haifeng Wang

This paper proposes an approach to improve statistical word alignment with the boosting method. Applying boosting to word alignment must solve two problems. The first is how to build the reference set for the training data. We propose an approach to automatically build a pseudo reference set, which can avoid manual annotation of the training set. The second is how to calculate the error rate of each individual word aligner. We solve this by calculating the error rate of a manually annotated held-out data set instead of the entire training set. In addition, the final ensemble takes into account the weights of the alignment links produced by the individual word aligners. Experimental results indicate that the boosting method proposed in this paper performs much better than the original word aligner, achieving a large error rate reduction.