Proceedings of Machine Translation Summit X: Posters
This paper presents TTPlayer, a trace file analysis tool used to develop TransType, an innovative computer-aided translation system. We first discuss the context of the project and the design of the tracing tool. We show how it was used for discovering interesting patterns of use as well to guide further developments in the TT2 project.
In Thai language, the word boundary is not explicitly clear, therefore, word segmentation is needed to determine word boundary in Thai sentences. Many applications of Thai Language Processing require the word segmentation. Several approaches of Thai word segmentation such as maximal matching, longest matching and n-gram model do not take semantics into consideration. This paper presents a Thai word segmentation system using semantic corpus which is composed of four steps: generating all possible candidates, proper noun consideration, semantic tagging and semantic checking. The first three steps are conducted using a dictionary. Semantic checking is carried out on the basis of corpus-based approach. Finally, we assign the semantic scores to segmented words and select the ones that contain maximum semantic scores. In order to assign semantic scores, we use a Thai proper noun database and the semantic corpus derived from ORCHID corpus. This approach is more reliable than other approaches that do not take the meaning into consideration and performs the level of accuracy at 96-99% depending on the characteristic of input and the dictionary used in the segmentation.
Existing syntactic ordered tree mining methods for extracting characteristic contents from text sets have two problems: 1) subtrees which are semantically the same but are different ordered trees fail to be considered equivalent, and 2) raw extracted subtrees can be difficult to understand. In order to avoid these problems, we have developed a method of transforming all ordered trees so that the ordered trees having the same meaning are considered equivalent. We have also developed a method of constructing Japanese texts from extracted subtrees, and evaluated the effectiveness of our methods as applied to syntactic tree mining.
The issue of translation divergence is an important research topic in the area of machine translation. An exhaustive study of the divergence issues in MT is necessary for their proper classification and resolution. In the literature on MT, scholars have examined the issue and have proposed ways for their classification and resolution (Dorr 1993, 1994). However, the topic still needs further exploration to identify different sources of translation divergence in different pairs of translation languages. In this paper, we discuss translation patterns between Hindi and English of different types of constructions with a view to identifying the potential topics of the translation divergences. We take Dorr’s (1993, 1994) classification of translation divergence as the base to examine the different topics of translation divergence in Hindi and English. The primary goal of the paper is to point out different types of translation divergences in Hindi and English MT that have not been discussed in the existing literature.
In the paper we present an outline of our approach to identify languages and encoding schemes in extremely large sets of multi-lingual documents. The large sets we are analyzing in our Language Observatory project  are formed by dozens of millions of text documents. In the paper we present an approach which allows us to analyze about 250 documents every second (about 20 million documents/day) on a single Linux machine. Using a multithread processing on a cluster of Linux servers we are able to analyze easily more than 100 million documents/day.
ki is an indeclinable element (particle) in Hindi which is used in multiple roles that have multiple mapping patterns in English. In one of its uses, ki functions as a clause complementizer and is mapped usually by that in declarative clauses and by various wh-words (such as what, why, where, how, etc.) in interrogative clauses. The contexts of these mappings are dependent on syntactic-semantic types of the clause. In its non-complementizer use, ki is used to denote various other functions such as coordinate conjunction, purpose and reason clause conjunction, yes-no question particle, etc. It is a difficult task to identify the different uses of ki and determine its multiple mapping patterns in the context of Hindi-English machine translation. A detailed linguistic analysis is needed to disambiguate the different contexts of ki in Hindi. In this paper, we examine the multiple uses and patterns of ki in Hindi and propose strategies for their identification and disambiguation for Hindi-English MT.
This paper describes a generalized translation memory system, which takes advantage of sentence level matching, sub-sentential matching, and pattern-based machine translation technologies. All of the three techniques generate translation suggestions with the assistance of word alignment information. For the sentence level matching, the system generates the translation suggestion by modifying the translations of the most similar example with word alignment information. For sub-sentential matching, the system locates the translation fragments in several examples with word alignment information, and then generates the translation suggestion by combining these translation fragments. For pattern-based machine translation, the system first extracts translation patterns from examples using word alignment information and then generates translation suggestions with pattern matching. This system is compared with a traditional translation memory system without word alignment information in terms of translation efficiency and quality. Evaluation results indicate that our system improves the translation quality and saves about 20% translation time.
The present work describes a Phrasal Example Based Machine Translation system from English to Bengali that identifies the phrases in the input through a shallow analysis, retrieves the target phrases using a Phrasal Example base and finally combines the target language phrases employing some heuristics based on the phrase ordering rules for Bengali. The paper focuses on the structure of the noun, verb and prepositional phrases in English and how these phrases are realized in Bengali. This study has an effect on the design of the phrasal Example Base and recombination rules for the target language phrases.
This paper describes an attempt to recycle parts of the Czech-to-Russian machine translation system (MT) in the new Czech-to-English MT system. The paper describes the overall architecture of the new system and the details of the modules which have been added. A special attention is paid to the problem of named entity recognition and to the method of automatic acquisition of lexico-syntactic information for the bilingual dictionary of the system.
In this document we will describe a semi-automated process for creating elicitation corpora. An elicitation corpus is translated by a bilingual consultant in order to produce high quality word aligned sentence pairs. The corpus sentences are automatically generated from detailed feature structures using the GenKit generation program. Feature structures themselves are automatically generated from information that is provided by a linguist using our corpus specification software. This helps us to build small, flexible corpora for testing and development of machine translation systems.
This paper presents a strategy for detecting and using multi-word expressions in Statistical Machine Translation. Performance of the proposed strategy is evaluated in terms of alignment quality as well as translation accuracy. Evaluations are performed by using the Verbmobil corpus. Results from translation tasks from English-to-Spanish and from Spanish-to-English are presented and discussed.
This paper presents an online Thai-English MT system, called PARSITTE, which is an extension of PARSIT English-Thai one. We aim to assist foreigners and Thai in exchanging more easily their information. The system is a rule-based and Interlingua approach. To improve the system, we concentrate on pre-processing and rule analysis phases, which are considered necessary because of some specific problems of Thai language.
The use of n-gram metrics to evaluate the output of MT systems is widespread. Typically, they are used in system development, where an increase in the score is taken to represent an improvement in the output of the system. However, purchasers of MT systems or services are more concerned to know how well a score predicts the acceptability of the output to a reader-user. Moreover, they usually want to know if these predictions will hold across a range of target languages and text types. We describe an experiment involving human and automated evaluations of four MT systems across two text types and 23 language directions. It establishes that the correlation between human and automated scores is high, but that the predictive power of these scores depends crucially on target language and text type.
This paper reports the result of our experiment, the aim of which is to examine the efficiency of reading support systems such as a sentence-machine translation system, a word-machine translation system, and so on. Our evaluation method used in the experiment is able to handle the different reading support systems by assessing the usability of the systems, i.e., comprehension, reading speed, and effective speed. The result shows that the reading-speed procedure is able to evaluate the support systems as well as the comprehension-based procedure proposed by Ohguro (1993) and Fuji et al. (2001).
An aligned corpus is an important resource for developing machine translation systems. We consider suitable units for constructing the translation model through observing an aligned parallel corpus. We examine the characteristics of the aligned corpus. Long sentences are especially difficult for word alignment because the sentences can become very complicated. Also, each (source/target) word has a higher possibility to correspond to the (target/source) word. This paper introduces an alignment viewer a developer can use to correct alignment information. We discuss using the viewer on a patent parallel corpus because sentences in patents are often long and complicated.
Uyghur is one of the Turkic languages in the Altaic language family. We are developing a machine translation system to translate from English into Uyghur. As there are no previous researches devoted to machine translation between English and Uyghur and being short of related works that we could use as a base for our research, we noted that by making clear the morphological and syntactic similarities and differences between Japanese and Uyghur we can make use of the approaches and methods of English-Japanese machine translation to make faster progress in our research. In order to attain this goal, we have performed a comparative study on the Japanese and Uyghur grammars. In this paper, we describe the similarities as well as differences between Japanese and Uyghur in both levels of morphology and syntax and we give a brief description of our English-Uyghur transfer method to which we are aiming at applying our comparative study on Japanese and Uyghur grammars.
This paper investigates optimal ways to get maximal coverage from minimal input training corpus. In effect, it seems antagonistic to think of minimal input training with a statistical machine translation system. Since statistics work well with repetition and thus capture well highly occurring words, one challenge has been to figure out the optimal number of “new” words that the system needs to be appropriately trained. Additionally, the goal is to minimize the human translation time for training a new language. In order to account for rapid ramp-up translation, we ran several experiments to figure out the minimal amount of data to obtain optimal translation results.
This paper describes an approach to preprocess SMS text for Machine Translation. As SMS text behaves differently from normal written text and to reduce the tremendous effort required to customize or adapt the language model of the traditional translation system to handle SMS text style, normalization is performed to moderate the irregularities in English SMS text using a noisy channel model. A mapping model is used to model the three major problems in SMS text. They are (1) substitution of word using non-standard acronym, (2) insertion of flavour word, and (3) omission of auxiliary verb and subject pronoun. Experiment results show that with normalization before translation, the rejection rate of our English-to-Chinese SMS translation for broadcasting purpose is reduced by 15.5%. We believe that the performance of normalization can be further improved with deeper linguistic processing.
This paper presents a look inside the ITC-irst large-vocabulary SMT system developed for the NIST 2005 Chinese-to-English evaluation campaign. Experiments on official NIST test sets provide a thorough overview of the performance of the system, supplying information on how single components contribute to the global performance. The presented system exhibits performance comparable to that of the best systems participating in the NIST 2002-2004 MT evaluation campaigns: on the three test sets, achieved BLEU scores are 26.35%, 26.92% and 28.13%, respectively.
The paper describes the architecture and functionality of LTC Communicator, a software product from the Language Technology Centre Ltd, which offers an innovative and cost-effective response to the growing need for multilingual web based communication in various user contexts. LTC Communicator was originally developed to support software vendors operating in international markets facing the need to offer web based multilingual support to diverse customers in a variety of countries, where end users may not speak the same language as the helpdesk. This is followed by a short description of several additional application areas of this software for which LTC has received EU funding: The AMBIENT project carries out a market validation for multilingual and multimodal eLearning for business and innovation management, the EUCAM project tests multilingual eLearning in the automotive industry, including a major car manufacturer and the German and European Metal Workers Associations, and the ALADDIN project provides a mobile multilingual environment for tour guides, interacting between tour operators and tourists, with the objective of optimising their travel experience. Finally, a case study of multilingual email exchange in conjunction with web based product sales is described.
A survey of the machine translation systems that have been developed in India for translation from English to Indian languages and among Indian languages reveals that the MT softwares are used in field testing or are available as web translation service. These systems are also used for teaching machine translation to the students and researchers. Most of these systems are in the English-Hindi or Indian language-Indian language domain. The translation domains are mostly government documents/reports and news stories. There are a number of other MT systems that are at their various phases of development and have been demonstrated at various forums. Many of these systems cover other Indian languages beside Hindi.
This paper describes a cellular-telephone-based text-to-text translation system developed at Transclick, Inc. The application translates messages bi-directionally in English, French, German, Italian, Spanish and Portuguese. This paper describes design features uniquely suited to hand-held-device based translation systems. In particular, we discuss some of the usability conditions unique to this type of application and present strategies for overcoming usability obstacles encountered in the design phase of the product.