Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
Spanish-to-Basque MultiEngine Machine Translation for a Restricted Domain
Iñaki Alegria | Arantza Casillas | Arantza Diaz de Ilarraza | Jon Igartua | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola
We present our initial strategy for Spanish-to-Basque MultiEngine Machine Translation, a language pair with very different structure and word order and with no huge parallel corpus available. This hybrid proposal is based on the combination of three different MT paradigms: Example-Based MT, Statistical MT and Rule- Based MT. We have evaluated the system, reporting automatic evaluation metrics for a corpus in a test domain. The first results obtained are encouraging.
This paper presents a method for exploiting document-level similarity between the documents in the training corpus for a corpus-driven (statistical or example-based) machine translation system and the input documents it must translate. The method is simple to implement, efficient (increases the translation time of an example-based system by only a few percent), and robust (still works even when the actual document boundaries in the input text are not known). Experiments on French-English and Arabic-English showed relative gains over the same system without using document-level similarity of up to 7.4% and 5.4%, respectively, on the BLEU metric.
This work extends phrase-based statistical MT (SMT) with shallow syntax dependencies. Two string-to-chunks translation models are proposed: a factored model, which augments phrase-based SMT with layered dependencies, and a joint model, that extends the phrase translation table with microtags, i.e. per-word projections of chunk labels. Both rely on n-gram models of target sequences with different granularity: single words, micro-tags, chunks. In particular, n-grams defined over syntactic chunks should model syntactic constraints coping with word-group movements. Experimental analysis and evaluation conducted on two popular Chinese-English tasks suggest that the shallow-syntax joint-translation model has potential to outperform state-of-the-art phrase-based translation, with a reasonable computational overhead.
We construct a discriminative, syntactic language model (LM) by using a latent support vector machine (SVM) to train an unlexicalized parser to judge sentences. That is, the parser is optimized so that correct sentences receive high-scoring trees, while incorrect sentences do not. Because of this alternative objective, the parser can be trained with only a part-of-speech dictionary and binary-labeled sentences. We follow the paradigm of discriminative language modeling with pseudo-negative examples (Okanohara and Tsujii, 2007), and demonstrate significant improvements in distinguishing real sentences from pseudo-negatives. We also investigate the related task of separating machine-translation (MT) outputs from reference translations, again showing large improvements. Finally, we test our LM in MT reranking, and investigate the language-modeling parser in the context of unsupervised parsing.
Convergence and simplification are two of the so-called universals in translation studies. The first one postulates that translated texts tend to be more similar than non-translated texts. The second one postulates that translated texts are simpler, easier-to-understand than non-translated ones. This paper discusses the results of a project which applies NLP techniques over comparable corpora of translated and non-translated texts in Spanish seeking to establish whether these two universals hold Corpas Pastor (2008).
Reordering is one source of error in statistical machine translation (SMT). This paper extends the study of the statistical machine reordering (SMR) approach, which uses the powerful techniques of the SMT systems to solve reordering problems. Here, the novelties yield in: (1) using the SMR approach in a SMT phrase-based system, (2) adding a feature function in the SMR step, and (3) analyzing the reordering hypotheses at several stages. Coherent improvements are reported in the TC-STAR task (Es/En) at a relatively low computational cost.
Source languages with complex word-formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation.
To aid research and development in machine translation, we have produced a test collection for Japanese/English machine translation. To obtain a parallel corpus, we extracted patent documents for the same or related inventions published in Japan and the United States. Our test collection includes approximately 2000000 sentence pairs in Japanese and English, which were extracted automatically from our parallel corpus. These sentence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retrieving patent documents across languages. This paper describes our test collection, methods for evaluating machine translation, and preliminary experiments.
We present an approach for online handling of Out-of-Vocabulary (OOV) terms in Urdu-English MT. Since Urdu is morphologically richer than English, we expect a large portion of the OOV terms to be Urdu morphological variations that are irrelevant to English. We describe an approach to automatically learn English-irrelevant (target-irrelevant) Urdu (source) morphological variation rules from standard phrase tables. These rules are learned in an unsupervised (or lightly supervised) manner by exploiting redundancy in Urdu and collocation with English translations. We use these rules to hypothesize in-vocabulary alternatives to the OOV terms. Our results show that we reduce the OOV rate from a standard baseline average of 2.6% to an average of 0.3% (or 89% relative decrease). We also increase the BLEU score by 0.45 (absolute) and 2.8% (relative) on a standard test set. A manual error analysis shows that 28% of handled OOV cases produce acceptable translations in context.
Phrase-based translation models are widely studied in statistical machine translation (SMT). However, the existing phrase-based translation models either can not deal with non-contiguous phrases or reorder phrases only by the rules without an effective reordering model. In this paper, we propose a generalized reordering model (GREM) for phrase-based statistical machine translation, which is not only able to capture the knowledge on the local and global reordering of phrases, but also is able to obtain some capabilities of phrasal generalization by using non-contiguous phrases. The experimental results have indicated that our model out- performs MEBTG (enhanced BTG with a maximum entropy-based reordering model) and HPTM (hierarchical phrase-based translation model) by improvement of 1.54% and 0.66% in BLEU.
This paper describes a new alignment method that extracts high quality multi-word alignments from sentence-aligned multilingual parallel corpora. The method can handle several languages at once. The phrase tables obtained by the method have a comparable accuracy and a higher coverage than those obtained by current methods. They are also obtained much faster.
We extend discriminative n-gram language modeling techniques originally proposed for automatic speech recognition to a statistical machine translation task. In this context, we propose a novel data selection method that leads to good models using a fraction of the training data. We carry out systematic experiments on several benchmark tests for Chinese to English translation using a hierarchical phrase-based machine translation system, and show that a discriminative language model significantly improves upon a state-of-the-art baseline. The experiments also highlight the benefits of our data selection method.
Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their corresponding reference translations. However, obtaining reference translations is expensive. In our earlier work (Madnani et al., 2007), we introduced a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and demonstrated that the resulting paraphrases can be used to cut the number of human reference translations needed in half. In this paper, we take the idea a step further, asking how far it is possible to get with just a single good reference translation for each item in the development set. Our analysis suggests that it is necessary to invest in four or more human translations in order to significantly improve on a single translation augmented by monolingual paraphrases.
This paper presents an attempt at developing a technique of acquiring translation pairs of technical terms with sufficiently high precision from parallel patent documents. The approach taken in the proposed technique is based on integrating the phrase translation table of a state-of-the-art statistical phrase-based machine translation model, and compositional translation generation based on an existing bilingual lexicon for human use. Our evaluation results clearly show that the agreement between the two individual techniques definitely contribute to improving precision of translation candidates. We then apply the Support Vector Machines (SVMs) to the task of automatically validating translation candidates in the phrase translation table. Experimental evaluation results again show that the SVMs based approach to translation candidates validation can contribute to improving the precision of translation candidates in the phrase translation table.
In this paper, we propose a probabilistic phrase alignment model based on dependency trees. This model is linguistically-motivated, using syntactic information during alignment process. The main advantage of this model is that the linguistic difference between source and target languages is successfully absorbed. It is composed of two models: Model1 is using content word translation probability and function word translation probability; Model2 uses dependency relation probability which is defined for a pair of positional relations on dependency trees. Relation probability acts as tree-based phrase reordering model. Since this model is directed, we combine two alignment results from bi-directional training by symmetrization heuristics to get definitive alignment. We conduct experiments on a Japanese-English corpus, and achieve reasonably high quality of alignment compared with word-based alignment model.
Most work in syntax-based machine translation has been in translation modeling, but there are many reasons why we may instead want to focus on the language model. We experiment with parsers as language models for machine translation in a simple translation model. This approach demands much more of the language models, allowing us to isolate their strengths and weaknesses. We find that unmodified parsers do not improve BLEU scores over ngram language models, and provide an analysis of their strengths and weaknesses.
This paper applies nonparametric statistical techniques to Machine Translation (MT) Evaluation using data from a large scale task-based study. In particular, the relationship between human task performance on an information extraction task with translated documents and well-known automated translation evaluation metric scores for those documents is studied. Findings from a correlation analysis of this connection are presented and contrasted with current strategies for evaluating translations. An extended analysis that involves a novel idea for assessing partial rank correlation within the presence of grouping factors is also discussed. This work exposes the limitations of descriptive statistics generally used in this area, mainly correlation analysis, when using automated metrics for assessments in task handling purposes.
State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOP-inspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline.
The continuous emergence of new technical terms and the difficulty of keeping up with neologism in parallel corpora deteriorate the performance of statistical machine translation (SMT) systems. This paper explores the use of morphological information to improve English-to-Chinese translation for technical terms. To reduce the morpheme-level translation ambiguity, we group the morphemes into morpheme phrases and propose the use of domain information for translation candidate selection. In order to find correspondences of morpheme phrases between the source and target languages, we propose an algorithm to mine morpheme phrase translation pairs from a bilingual lexicon. We also build a cascaded translation model that dynamically shifts translation units from phrase level to word and morpheme phrase levels. The experimental results show the significant improvements over the current phrase-based SMT systems.
We introduce a method for learning to find domain-specific translations for a given term on the Web. In our approach, the source term is transformed into an expanded query aimed at maximizing the probability of retrieving translations from a very large collection of mixed-code documents. The method involves automatically generating sets of target-language words from training data in specific domains, automatically selecting target words for effectiveness in retrieving documents containing the sought-after translations. At run time, the given term is transformed into an expanded query and submitted to a search engine, and ranked translations are extracted from the document snippets returned by the search engine. We present a prototype, TermMine, which applies the method to a Web search engine. Evaluations over a set of domains and terms show that TermMine outperforms state-of-the-art machine translation systems.
We propose a two-stage system for spoken language machine translation. In the first stage, the source sentence is parsed and paraphrased into an intermediate language which retains the words in the source language but follows the word order of the target language as much as feasible. This stage is mostly linguistic. In the second stage, a statistical MT is performed to translate the intermediate language into the target language. For the task of English-to-Mandarin translation, we achieved a 2.5 increase in BLEU score and a 45% decrease in GIZA-Alignment Crossover, on IWSLT-06 data. In a human evaluation of the sentences that differed, the two-stage system was preferred three times as often as the baseline.