Margaret King

In this paper, we propose a formal framework that takes into account the influence of the intended context of use of an NLP system on the procedure and the metrics used to evaluate the system. We introduce in particular the notion of a context-dependent quality model and explain how it can be adapted to a given context of use. More specifically, we define vector-space representations of contexts of use and of quality models, which are connected by a generic contextual quality model (GCQM). For each domain, experts in evaluation are needed to build a GCQM based on analytic knowledge and on previous evaluations, using the mechanism proposed here. The main inspiration source for this work is the FEMTI framework for the evaluation of machine translation, which implements partly the present model, and which is described briefly along with insights from other domains.

This paper looks at a class of systems which pose severe problems in evaluation design for current conventional approaches to evaluation. After describing the two conventional evaluation paradigms: the functionality paradigm as typified by evaluation campaigns and the ISO inspired user-centred paradigm typified by the work of the EAGLES and ISLE projects, it goes on to outline the problems posed by the evaluation of systems which are designed to work in critical interaction with a human expert user and to work over vast amounts of data. These systems pose problems for both paradigms although for different reasons. The primary aim of this paper is to provoke discussion and the search for solutions. We have no proven solutions at present. However, we describe a programme of exploratory research on which we have already embarked, which involves ground clearing work which we expect to result in a deep understanding of the systems and users, a pre-requisite for developing a general framework for evaluation in this field.

2003

This paper presents FEMTI, a web-based Framework for the Evaluation of Machine Translation in ISLE. FEMTI offers structured descriptions of potential user needs, linked to an overview of technical characteristics of MT systems. The description of possible systems is mainly articulated around the quality characteristics for software product set out in ISO/IEC standard 9126. Following the philosophy set out there and in the related 14598 series of standards, each quality characteristic bottoms out in metrics which may be applied to a particular instance of a system in order to judge how satisfactory the system is with respect to that characteristic. An evaluator can use the description of user needs to help identify the specific needs of his evaluation and the relations between them. He can then follow the pointers to system description to determine what metrics should be applied and how. In the current state of the framework, emphasis is on being exhaustive, including as much as possible of the information available in the literature on machine translation evaluation. Future work will aim at being more analytic, looking at characteristics and metrics to see how they relate to one another, validating metrics and investigating the correlation between particular metrics and human judgement.

2002

2000

1999

This panel deals with the general topic of evaluation of machine translation systems. The first contribution sets out some recent work on creating standards for the design of evaluations. The second, by Eduard Hovy. takes up the particular issue of how metrics can be differentiated and systematized. Benjamin K. T'sou suggests that whilst men may evaluate machines, machines may also evaluate men. John S. White focuses on the question of the role of the user in evaluation design, and Yusoff Zaharin points out that circumstances and settings may have a major influence on evaluation design.

Translation managers often have to decide on the most appropriate way to deal with a translation project. Possible options may include human translation, translation using a specific terminology resource, translation in interaction with a translation memory system, and machine translation. The decision making involved is complex, and it is not always easy to decide by inspection whether a specific text lends itself to certain kinds of treatment. TransRouter supports the decision making by offering a suite of computer based tools which can be used to analyse the text to be translated. Some tools, such as the word counter, the repetition detector, the sentence length estimator and the sentence simplicity checker look at characteristics of the text itself. A version comparison tool compares the new text to previously translated texts. Other tools, such as the unknown terms detector and the translation memory coverage estimator, estimate overlap between the text and a set of known resources. The information gained, combined with further information provided by the user, is input to a decision kernel which calculates possible routes towards achieving the translation together with their cost and consequences on translation quality. The user may influence the kernel by, for example, specifying particular resources or refining routes under investigation. The final decision on how to treat the project rests with the translation manager.

Fixing paper assignments

2006

2003

2002

2000

1999

1994

1993

1991

1990

1989

1986

1984

1981

Co-authors

Venues