Gregory Sanders

Also published as: Greg Sanders, Gregory A. Sanders

2009

2008

pdf abs
Odds of Successful Transfer of Low-Level Concepts: a Key Metric for Bidirectional Speech-to-Speech Machine Translation in DARPA’s TRANSTAC Program
Gregory Sanders | Sébastien Bronsart | Sherri Condon | Craig Schlenoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program is a Defense Advanced Research Agency (DARPA) program to create bidirectional speech-to-speech machine translation (MT) that will allow U.S. Soldiers and Marines, speaking only English, to communicate, in tactical situations, with civilian populations who speak only other languages (for example, Iraqi Arabic). A key metric for the program is the odds of successfully transferring low-level concepts, defined as the source-language content words. The National Institute of Standards and Technology (NIST) has now carried out two large-scale evaluations of TRANSTAC systems, using that metric. In this paper we discuss the merits of that metric. It has proven to be quite informative. We describe exactly how we defined this metric and how we obtained values for it from panels of bilingual judges allowing others to do what we have done. We compare results on this metric to results on Likert-type judgments of semantic adequacy, from the same panels of bilingual judges, as well as to a suite of typical automated MT metrics (BLEU, TER, METEOR).

Over the past five years, the Defense Advanced Research Projects Agency (DARPA) has funded development of speech translation systems for tactical applications. A key component of the research program has been extensive system evaluation, with dual objectives of assessing progress overall and comparing among systems. This paper describes the methods used to obtain BLEU, TER, and METEOR scores for two-way English-Iraqi Arabic systems. We compare the scores with measures based on human judgments and demonstrate the effects of normalization operations on BLEU scores. Issues that are highlighted include the quality of test data and differential results of applying automated metrics to Arabic vs. English.

One of the most challenging tasks for uniformed service personnel serving in foreign countries is effective verbal communication with the local population. To remedy this problem, several companies and academic institutions have been funded to develop machine translation systems as part of the DARPA TRANSTAC (Spoken Language Communication and Translation System for Tactical Use) program. The goal of this program is to demonstrate capabilities to rapidly develop and field free-form, two-way translation systems that would enable speakers of different languages to communicate with one another in real-world tactical situations. DARPA has mandated that each TRANSTAC technology be evaluated numerous times throughout the life of the program and has tasked the National Institute of Standards and Technology (NIST) to lead this effort. This paper describes the experimental design methodology and test procedures from the most recent evaluation, conducted in July 2007, which focused on English to/from Iraqi Arabic.

2006

pdf abs
Edit Distance: A Metric for Machine Translation Evaluation
Mark Przybocki | Gregory Sanders | Audrey Le
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

NIST has coordinated machine translation (MT) evaluations for several years using an automatic and repeatable evaluation measure. Under the Global Autonomous Language Exploitation (GALE) program, NIST is tasked with implementing an edit-distance-based evaluation of MT. Here edit distance is defined to be the number of modifications a human editor is required to make to a system translation such that the resulting edited translation contains the complete meaning in easily understandable English, as a single high-quality human reference translation. In preparation for this change in evaluation paradigm, NIST conducted two proof-of-concept exercises specifically designed to probe the data space, to answer questions related to editor agreement, and to establish protocols for the formal GALE evaluations. We report here our experimental design, the data used, and our findings for these exercises.