Given
multiple
translations
of
the
same
source
sentence
,
how
to
combine
them
to
produce
a
translation
that
is
better
than
any
single
system
output
?
We
propose
a
hierarchical
system
combination
framework
for
machine
translation
.
This
framework
integrates
multiple
MT
systems
'
output
at
the
word
-
,
phrase
-
and
sentence
-
levels
.
By
boosting
common
word
and
phrase
translation
pairs
,
pruning
unused
phrases
,
and
exploring
decoding
paths
adopted
by
other
MT
systems
,
this
framework
achieves
better
translation
quality
with
much
less
re-decoding
time
.
The
full
sentence
translation
hypotheses
from
multiple
systems
are
additionally
selected
based
on
N-gram
language
models
trained
on
word
/
word-POS
mixed
stream
,
which
further
improves
the
translation
quality
.
We
consistently
observed
significant
improvements
on
several
test
sets
in
multiple
languages
covering
different
genres
.
1
Introduction
Many
machine
translation
(
MT
)
frameworks
have
been
developed
,
including
rule-based
transfer
MT
,
corpus-based
MT
(
statistical
MT
and
example-based
MT
)
,
syntax-based
MT
and
the
hybrid
,
statistical
MT
augmented
with
syntactic
structures
.
Different
MT
paradigms
have
their
strengths
and
weaknesses
.
This
work
was
done
when
the
author
was
at
IBM
Research
.
Systems
adopting
the
same
framework
usually
produce
different
translations
for
the
same
input
,
due
to
their
differences
in
training
data
,
preprocessing
,
alignment
and
decoding
strategies
.
It
is
beneficial
to
design
a
framework
that
combines
the
decoding
strategies
of
multiple
systems
as
well
as
their
outputs
and
produces
translations
better
than
any
single
system
output
.
More
recently
,
within
the
GALE1
project
,
multiple
MT
systems
have
been
developed
in
each
consortium
,
thus
system
combination
becomes
more
important
.
Traditionally
,
system
combination
has
been
conducted
in
two
ways
:
glass-box
combination
and
black-box
combination
.
In
the
glass-box
combination
,
each
MT
system
provides
detailed
decoding
information
,
such
as
word
and
phrase
translation
pairs
and
decoding
lattices
.
For
example
,
in
the
multi-engine
machine
translation
system
(
Nirenburg
and
Frederking
,
1994
)
,
target
language
phrases
from
each
system
and
their
corresponding
source
phrases
are
recorded
in
a
chart
structure
,
together
with
their
confidence
scores
.
A
chart-walk
algorithm
is
used
to
select
the
best
translation
from
the
chart
.
To
combine
words
and
phrases
from
multiple
systems
,
it
is
preferable
that
all
the
systems
adopt
similar
preprocessing
strategies
.
In
the
black-box
combination
,
individual
MT
systems
only
output
their
top-N
translation
hypotheses
without
decoding
details
.
This
is
particularly
appealing
when
combining
the
translation
outputs
from
COTS
MT
systems
.
The
final
translation
may
be
selected
by
voted
language
models
and
appropriate
confidence
rescaling
schemes
(
(
Tidhar
and
Kuss
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
277-286
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
2006
)
decomposes
source
sentences
into
meaningful
constituents
,
translates
them
with
component
MT
systems
,
then
selects
the
best
segment
translation
and
combine
them
based
on
majority
voting
,
language
models
and
confidence
scores
.
(
Jayaraman
and
Lavie
,
2005
)
proposed
another
black-box
system
combination
strategy
.
Given
single
top-one
translation
outputs
from
multiple
MT
systems
,
their
approach
reconstructs
a
phrase
lattice
by
aligning
words
from
different
MT
hypotheses
.
The
alignment
is
based
on
the
surface
form
of
individual
words
,
their
stems
(
after
morphology
analysis
)
and
part-of-speech
(
POS
)
tags
.
Aligned
words
are
connected
via
edges
.
The
algorithm
finds
the
best
alignment
that
minimizes
the
number
of
crossing
edges
.
Finally
the
system
generates
a
new
translation
by
searching
the
lattice
based
on
alignment
information
,
each
system
's
confidence
scores
and
a
language
model
score
.
(
Matusov
et
al.
,
2006
)
and
(
Rosti
et
al.
,
2007
)
constructed
a
confusion
network
from
multiple
MT
hypotheses
,
and
a
consensus
translation
is
selected
by
redecoding
the
lattice
with
arc
costs
and
confidence
scores
.
In
this
paper
,
we
introduce
our
hierarchical
system
combination
strategy
.
This
approach
allows
combination
on
word
,
phrase
and
sentence
levels
.
Similar
to
glass-box
combination
,
each
MT
system
provides
detailed
information
about
the
translation
process
,
such
as
which
source
word
(
s
)
generates
which
target
word
(
s
)
in
what
order
.
Such
information
can
be
combined
with
existing
word
and
phrase
translation
tables
,
and
the
augmented
phrase
table
will
be
significantly
pruned
according
to
reliable
MT
hypotheses
.
We
select
an
MT
system
to
retranslate
the
test
sentences
with
the
refined
models
,
and
encourage
search
along
decoding
paths
adopted
by
other
MT
systems
.
Thanks
to
the
refined
translation
models
,
this
approach
produces
better
translations
with
a
much
shorter
re-decoding
time
.
As
in
the
black-box
combination
,
we
select
full
sentence
translation
hypotheses
from
multiple
system
outputs
based
on
n-gram
language
models
.
This
hierarchical
system
combination
strategy
avoids
problems
like
translation
output
alignment
and
confidence
score
normalization
.
It
seamlessly
integrates
detailed
decoding
information
and
translation
hypotheses
from
multiple
MT
engines
,
and
produces
better
transla
-
tions
in
an
efficient
manner
.
Empirical
studies
in
a
later
section
show
that
this
algorithm
improves
MT
quality
by
2.4
BLEU
point
over
the
best
baseline
decoder
,
with
a
1.4
TER
reduction
.
We
also
observed
consistent
improvements
on
several
evaluation
test
sets
in
multiple
languages
covering
different
genres
by
combining
several
state-of-the-art
MT
systems
.
The
rest
of
the
paper
is
organized
as
follows
:
In
section
2
,
we
briefly
introduce
several
baseline
MT
systems
whose
outputs
are
used
in
the
system
combination
.
In
section
3
,
we
present
the
proposed
hierarchical
system
combination
framework
.
We
will
describe
word
and
phrase
combination
and
pruning
,
decoding
path
imitation
and
sentence
translation
selection
.
We
show
our
experimental
results
in
section
4
and
conclusions
in
section
5
.
2
Baseline
MT
System
Overview
In
our
experiments
,
we
take
the
translation
outputs
from
multiple
MT
systems
.
These
include
phrase-based
statistical
MT
systems
(
Al-Onaizan
and
Papineni
,
2006
)
(
Block
)
and
(
Hewavitharana
et
al
,
2005
)
(
CMLLSMT
)
,
a
direct
translation
model
(
DTM
)
system
(
Ittycheriah
and
Roukos
,
2007
)
and
a
hierarchical
phrased-based
MT
system
(
Hiero
)
(
Chiang
,
2005
)
.
Different
translation
frameworks
are
adopted
by
different
decoders
:
the
DTM
decoder
combines
different
features
(
source
words
,
morphemes
and
POS
tags
,
target
words
and
POS
tags
)
in
a
maximum
entropy
framework
.
These
features
are
integrated
with
a
phrase
translation
table
for
flexible
distortion
model
and
word
selection
.
The
CMLLSMT
decoder
extracts
testset-specific
bilingual
phrases
on
the
fly
with
PESA
algorithm
.
The
Hiero
system
extracts
context-free
grammar
rules
for
long
range
constituent
reordering
.
We
select
the
IBM
block
decoder
to
re-translate
the
test
set
for
glass-box
system
combination
.
This
system
is
a
multi-stack
,
multi-beam
search
decoder
.
Given
a
source
sentence
,
the
decoder
tries
to
find
the
translation
hypothesis
with
the
minimum
translation
cost
.
The
overall
cost
is
the
log-linear
combination
of
different
feature
functions
,
such
as
translation
model
cost
,
language
model
cost
,
distortion
cost
and
sentence
length
cost
.
The
translation
cost
between
a
phrase
translation
pair
(
f
,
e
)
is
defined
as
where
feature
cost
functions
includes
:
where
t
(
fj
|
ei
)
is
the
word
translation
probabilities
,
estimated
based
on
word
alignment
frequencies
over
all
the
training
data
.
i
and
j
are
word
positions
in
target
and
source
phrases
.
S
(
e
,
f
)
,
a
phrase
translation
cost
estimated
according
to
their
relative
alignment
frequency
in
the
bilingual
training
data
,
A
's
in
Equation
1
are
the
weights
of
different
feature
functions
,
learned
to
maximize
development
set
BLEU
scores
using
a
method
similar
to
(
Och
,
2003
)
.
The
SMT
system
is
trained
with
testset-specific
training
data
.
This
is
not
cheating
.
Given
a
test
set
,
from
a
large
bilingual
corpora
we
select
parallel
sentence
pairs
covering
n-grams
from
source
sentences
.
Phrase
translation
pairs
are
extracted
from
the
sub-sampled
alignments
.
This
not
only
reduces
the
size
of
the
phrase
table
,
but
also
improves
topic
relevancy
of
the
extracted
phrase
pairs
.
As
a
results
,
it
improves
both
the
efficiency
and
the
performance
of
machine
translation
.
3
Hierarchical
System
Combination
Framework
The
overall
system
combination
framework
is
shown
in
Figure
1
.
The
source
text
is
translated
by
multiple
baseline
MT
systems
.
Each
system
produces
both
top-one
translation
hypothesis
as
well
as
phrase
pairs
and
decoding
path
during
translation
.
The
information
is
shared
through
a
common
XML
file
format
,
as
shown
in
Figure
2
.
It
demonstrates
how
a
source
sentence
is
segmented
into
a
sequence
of
phrases
,
the
order
and
translation
of
each
source
phrase
as
well
as
the
translation
scores
,
and
a
vector
of
feature
scores
for
the
whole
test
sentence
.
Such
XML
files
are
generated
by
all
the
systems
when
they
translate
the
source
test
set
.
We
collect
phrase
translation
pairs
from
each
decoder
's
output
.
Within
each
phrase
pair
,
we
identify
word
alignment
and
estimate
word
translation
probabilities
.
We
combine
the
testset-specific
word
translation
model
with
a
general
model
.
We
augment
the
baseline
phrase
table
with
phrase
translation
pairs
extracted
from
system
outputs
,
then
prune
the
table
with
translation
hypotheses
.
We
retranslate
the
source
text
using
the
block
decoder
with
updated
word
and
phrase
translation
models
.
Additionally
,
to
take
advantage
of
flexible
reordering
strategies
of
other
decoders
,
we
develop
a
word
order
cost
function
to
reinforce
search
along
decoding
paths
adopted
by
other
decoders
.
With
the
refined
translation
models
and
focused
search
space
,
the
block
decoder
efficiently
produces
a
better
translation
output
.
Finally
,
the
sentence
hypothesis
selection
module
selects
the
best
translation
from
each
systems
'
top-one
outputs
based
on
language
model
scores
.
Note
that
the
hypothesis
selection
module
does
not
require
detailed
decoding
information
,
thus
can
take
in
any
MT
systems
'
outputs
.
3.1
Word
Translation
Combination
The
baseline
word
translation
model
is
too
general
for
the
given
test
set
.
Our
goal
is
to
construct
a
testset-specific
word
translation
model
,
combine
it
with
the
general
model
to
boost
consensus
word
translations
.
Bilingual
phrase
translation
pairs
are
read
from
each
system-generated
XML
file
.
Word
alignments
are
identified
within
a
phrase
pair
based
on
IBM
Model-1
probabilities
.
As
the
phrase
pairs
are
typically
short
,
word
alignments
are
quite
accurate
.
We
collect
word
alignment
counts
from
the
whole
test
set
translation
,
and
estimate
both
source-to-target
and
target-to-source
word
translation
probabilities
.
We
combine
such
testset-specific
translation
model
with
the
general
model
.
where
t
'
(
e
|
f
)
is
the
testset-specific
source-to-target
word
translation
probability
,
and
t
(
e
|
f
)
is
the
prob
-
Figure
2
:
Sample
XML
file
format
.
This
includes
a
source
sentence
(
segmented
as
a
sequence
of
source
phrases
)
,
their
translations
as
well
as
a
vector
of
feature
scores
(
language
model
scores
,
translation
model
scores
,
distortion
model
scores
and
a
sentence
length
score
)
.
ability
from
general
model
.
7
is
the
linear
combination
weight
,
and
is
set
according
to
the
confidence
on
the
quality
of
system
outputs
.
In
our
experiments
,
we
set
7
to
be
0.8
.
We
combine
both
source-to-target
and
target-to-source
word
translation
models
,
and
update
the
word
translation
costs
,
—
log
p
(
e
|
f
)
and
—
log
p
(
f
|
e
)
,
accordingly
.
3.2
Phrase
Translation
Combination
and
Pruning
Phrase
translation
pairs
can
be
combined
in
two
different
ways
.
We
may
collect
and
merge
testset-specific
phrase
translation
tables
from
each
system
,
if
they
are
available
.
Essentially
,
this
is
similar
to
combining
the
training
data
ofmultiple
MT
systems
.
The
new
phrase
translation
probability
is
calculated
according
to
the
updated
phrase
alignment
frequencies
:
where
Cb
is
the
phrase
pair
count
from
the
baseline
block
decoder
,
and
Cm
is
the
count
from
other
MT
systems
.
am
is
a
system-specific
linear
combination
weight
.
If
not
all
the
phrase
tables
are
available
,
we
collect
phrase
translation
pairs
from
system
outputs
,
and
merge
them
with
Cb.
In
such
case
,
we
may
adjust
a
to
balance
the
small
counts
from
system
outputs
and
large
counts
from
Cb.
The
corresponding
phrase
translation
cost
is
updated
as
Another
phrase
combination
strategy
works
on
the
sentence
level
.
This
strategy
relies
on
the
consensus
of
different
MT
systems
when
translating
the
same
source
sentence
.
It
collects
phrase
translation
pairs
used
by
different
MT
systems
to
translate
the
same
sentence
.
Similarly
,
it
boosts
common
phrase
pairs
that
are
selected
by
multiple
decoders
.
where
/
/
is
a
boosting
factor
,
0
&lt;
/
/
&lt;
1
.
|
C
(
/
,
e
)
|
is
the
number
of
systems
that
use
phrase
pair
(
/
,
e
)
to
translate
the
input
sentence
.
A
phrase
translation
pair
selected
by
multiple
systems
is
more
likely
a
good
translation
,
thus
costs
less
.
The
combined
phrase
table
contains
multiple
translations
for
each
source
phrase
.
Many
of
them
are
unlikely
translations
given
the
context
.
These
phrase
pairs
produce
low-quality
partial
hypotheses
during
hypothesis
expansion
,
incur
unnecessary
model
cost
calculation
and
larger
search
space
,
and
reduce
the
translation
efficiency
.
More
importantly
,
the
translation
probabilities
of
correct
phrase
pairs
are
reduced
as
some
probability
mass
is
distributed
among
incorrect
phrase
pairs
.
As
a
result
,
good
phrase
pairs
may
not
be
selected
in
the
final
translation
.
Oracle
experiments
show
that
if
we
prune
the
phrase
table
and
only
keep
phrases
that
appear
in
the
reference
translations
,
we
can
improve
the
translation
quality
by
10
BLEU
points
.
This
shows
the
potential
gain
by
appropriate
phrase
pruning
.
We
developed
a
phrase
pruning
technique
based
on
self-training
.
This
approach
reinforces
phrase
translations
learned
from
MT
system
output
.
Assuming
we
have
reasonable
first-pass
translation
outputs
,
we
only
keep
phrase
pairs
whose
target
phrase
is
covered
by
existing
system
translations
.
These
phrase
pairs
include
those
selected
in
the
final
translations
,
as
well
as
their
combinations
or
sub-phrases
.
As
a
result
,
the
size
of
the
phrase
table
is
reduced
by
80-90
%
,
and
the
re-decoding
time
is
reduced
by
80
%
.
Because
correct
phrase
translations
are
assigned
higher
probabilities
,
it
generates
better
translations
with
higher
BLEU
scores
.
3.3
Decoding
Path
Imitation
Because
of
different
reordering
models
,
words
in
the
source
sentence
can
be
translated
in
different
orders
.
The
block
decoder
has
local
reordering
capability
that
allows
source
words
within
a
given
window
to
jump
forward
or
backward
with
a
certain
cost
.
The
DTM
decoder
takes
similar
reordering
strategy
,
with
some
variants
like
dynamic
window
width
depending
on
the
POS
tag
of
the
current
source
word
.
The
Hiero
system
allows
for
long
range
constituent
reordering
based
on
context-free
grammar
rules
.
To
combine
different
reordering
strategies
from
various
decoders
,
we
developed
a
reordering
cost
function
that
encourages
search
along
decoding
paths
adopted
by
other
decoders
.
From
each
system
's
XML
file
,
we
identify
the
order
of
translating
source
words
based
on
word
alignment
information
.
For
example
,
given
the
following
hypothesis
path
,
We
find
the
source
phrase
containing
words
[
0,1
]
is
first
translated
into
a
target
phrase
"
izzat
ibrahim
'
"
,
which
is
followed
by
the
translation
from
source
word
2
to
a
single
target
word
"
receives
"
,
etc.
.
We
identify
the
word
alignment
within
the
phrase
translation
pairs
based
on
IBM
model-1
scores
.
As
a
result
,
we
get
the
following
source
word
translation
sequence
from
the
above
hypothesis
(
note
:
source
word
5
is
translated
as
NULL
)
:
Such
decoding
sequence
determines
the
translation
order
between
any
source
word
pairs
,
e.g.
,
word
4
should
be
translated
before
word
3
,
6
and
7
.
We
collect
such
ordered
word
pairs
from
all
system
outputs
'
paths
.
When
re-translating
the
source
sentence
,
for
each
partially
expanded
decoding
path
,
we
compute
the
ratio
of
word
pairs
that
satisfy
such
ordering
constraints2
.
Specifically
,
given
a
partially
expanded
path
P
=
(
si
&lt;
s2
&lt;
•
•
•
&lt;
sm
}
,
word
pair
(
si
&lt;
sj
)
implies
si
is
translated
before
sj.
If
word
pair
(
si
&lt;
sj
)
is
covered
by
a
full
decoding
path
Q
(
from
other
system
outputs
)
,
we
denote
the
relationship
as
(
si
&lt;
sj
)
G
Q.
For
any
ordered
word
pair
(
si
&lt;
sj
)
G
P
,
we
define
its
matching
ratio
as
the
percentage
of
full
decoding
paths
that
cover
it
:
where
N
is
the
total
number
of
full
decoding
paths
.
We
define
the
path
matching
cost
function
:
The
denominator
is
the
total
number
of
ordered
word
pairs
in
path
P.
As
a
result
,
partial
paths
are
boosted
if
they
take
similar
source
word
translation
orders
as
other
system
outputs
.
This
cost
function
is
multiplied
with
a
manually
tuned
model
weight
before
integrating
into
the
log-linear
cost
model
framework
.
2We
set
no
constraints
for
source
words
that
are
translated
into
NULL
.
3.4
Sentence
Hypothesis
Selection
The
sentence
hypothesis
selection
module
only
takes
the
final
translation
outputs
from
individual
systems
,
including
the
output
from
the
glass-box
combination
.
For
each
input
source
sentence
,
it
selects
the
"
optimal
"
system
output
based
on
certain
feature
functions
.
We
experiment
with
two
feature
functions
.
One
is
a
typical
5-gram
word
language
model
(
LM
)
.
The
optimal
translation
output
E
'
is
selected
among
the
top-one
hypothesis
from
all
the
systems
according
to
their
LM
scores
.
Let
ei
be
a
word
in
sentence
E
:
(
ei-4
,
ei-3
,
ei-2
,
ei-1
)
.
Another
feature
function
is
based
on
the
5-gram
LM
score
calculated
on
the
mixed
stream
of
word
and
POS
tags
of
the
translation
output
.
We
run
POS
tagging
on
the
translation
hypotheses
.
We
keep
the
word
identities
of
top
N
frequent
words
(
N
=
1000
in
our
experiments
)
,
and
the
remaining
words
are
replaced
with
their
POS
tags
.
As
a
result
,
the
mixed
stream
is
like
a
skeleton
of
the
original
sentence
,
as
shown
in
Figure
3
.
With
this
model
,
the
optimal
translation
output
E
*
is
selected
based
on
the
following
formula
:
a
class-based
LM
,
this
model
is
less
prone
to
data
sparseness
problems
.
4
Experiments
We
experiment
with
different
system
combination
strategies
on
the
NIST
2003
Arabic-English
MT
Tstcom+Sentcom
Tstcom+Sentcom+Prune
Table
1
:
Translation
results
with
phrase
combination
and
pruning
.
and
TER
(
Snover
et
al.
,
2006
)
as
the
MT
evaluation
metrics
.
We
evaluate
the
translation
quality
of
different
combination
strategies
:
•
WdCom
:
Combine
testset-specific
word
translation
model
with
the
baseline
model
,
as
described
in
section
3.1
.
•
PhrCom
:
Combine
and
prune
phrase
translation
tables
from
all
systems
,
as
described
in
section
3.2
.
This
include
testset-specific
phrase
table
combination
(
Tstcom
)
,
sentence
level
phrase
combination
(
Sentcom
)
and
phrase
pruning
based
on
translation
hypotheses
(
Prune
)
.
•
Path
:
Encourage
search
along
the
decoding
paths
adopted
by
other
systems
via
path
matching
cost
function
,
as
described
in
section
3.3
.
•
SenSel
:
Select
whole
sentence
translation
hypothesis
among
all
systems
'
top-one
outputs
based
on
N-gram
language
models
trained
on
word
stream
(
word
)
and
word-POS
mixed
stream
(
wdpos
)
.
Table
1
shows
the
improvement
by
combining
phrase
tables
from
multiple
MT
systems
using
different
combination
strategies
.
We
only
show
the
highest
and
lowest
baseline
system
scores
.
By
combining
testset-specific
phrase
translation
tables
(
Tst-com
)
,
we
achieved
1.0
BLEU
improvement
and
0.5
TER
reduction
.
Sentence-level
phrase
combination
and
pruning
additionally
improve
the
BLEU
score
by
0.7
point
and
reduce
TER
by
0.4
percent
.
Table
2
shows
the
improvement
with
different
sentence
translation
hypothesis
selection
approaches
.
The
word-based
LM
is
trained
with
about
1.75G
words
from
newswire
text
.
A
distributed
SentSel-word
:
SentSel-wpmix
:
Table
2
:
Translation
results
with
different
sentence
hypothesis
selection
strategies
.
Table
4
:
System
combination
results
on
Chinese
-
BLEUr4n4c
lish
translation
.
WdCom+PhrCom
WdCom+PhrCom+Path
WdCom+PhrCom+Path+SenSel
Table
3
:
Translation
results
with
hierarchical
system
combination
strategy
.
Table
5
:
System
combination
results
for
Arabic-English
web
log
translation
.
large-scale
language
model
architecture
is
developed
to
handle
such
large
training
corpora3
,
as
described
in
(
Emami
et
al.
,
2007
)
.
The
word-based
LM
shows
both
improvement
in
BLEU
scores
and
error
reduction
in
TER
.
On
the
other
hand
,
even
though
the
word-POS
LM
is
trained
with
much
less
data
(
about
136M
words
)
,
it
improves
BLEU
score
more
effectively
,
though
there
is
no
change
in
TER
.
Table
3
shows
the
improvements
from
hierarchical
system
combination
strategy
.
We
find
that
word-based
translation
combination
improves
the
baseline
block
decoder
by
0.16
BLEU
point
and
reduce
TER
by
0.5
point
.
Phrase-based
translation
combination
(
including
phrase
table
combination
,
sentence-level
phrase
combination
and
phrase
pruning
)
further
improves
the
BLEU
score
by
1.9
point
(
another
0.6
drop
in
TER
)
.
By
encouraging
the
search
along
other
decoder
's
decoding
paths
,
we
observed
additional
0.15
BLEU
improvement
and
0.2
TER
reduction
.
Finally
,
sentence
translation
hypothesis
selection
with
word-based
LM
led
to
0.2
BLEU
point
improvement
and
0.16
point
reduction
in
TER
.
3The
same
LM
is
also
used
during
first
pass
decoding
by
both
the
block
and
the
DTM
decoders
.
summarize
,
with
the
hierarchical
system
combination
framework
,
we
achieved
2.4
BLEU
point
improvement
over
the
best
baseline
system
,
and
reduce
the
TER
by
1.4
point
.
Table
4
shows
the
system
combination
results
on
Chinese-English
newswire
translation
.
The
test
data
is
NIST
MT03
Chinese-English
evaluation
test
set
.
In
addition
to
the
4
baseline
MT
systems
,
we
also
add
another
phrase-based
MT
system
(
Lee
et
al.
,
2006
)
.
The
system
combination
improves
over
the
best
baseline
system
by
2
BLEU
points
,
and
reduce
the
TER
score
by
1.6
percent
.
Thanks
to
the
long
range
constituent
reordering
capability
of
different
baseline
systems
,
the
path
imitation
improves
the
BLEU
score
by
0.4
point
.
We
consistently
notice
improved
translation
quality
with
system
combination
on
unstructured
text
and
speech
translations
,
as
shown
in
Table
5
and
6
.
With
one
reference
translation
,
we
notice
1.2
BLEU
point
improvement
over
the
baseline
block
decoder
(
with
2.5
point
TER
reduction
)
on
web
log
translation
and
about
2.1
point
BLEU
improvement
(
with
0.9
point
TER
reduction
)
on
Broadcast
News
speech
translation
.
BLEUrln4c
Table
6
:
System
combination
results
for
Arabic-English
speech
translation
.
5
Related
Work
Many
system
combination
research
have
been
done
recently
.
(
Matusov
et
al.
,
2006
)
computes
consensus
translation
by
voting
on
a
confusion
network
,
which
is
created
by
pairwise
word
alignment
ofmul-tiple
baseline
MT
hypotheses
.
This
is
similar
to
the
sentence
-
and
word
-
level
combinations
in
(
Rosti
et
al.
,
2007
)
,
where
TER
is
used
to
align
multiple
hypotheses
.
Both
approaches
adopt
black-box
combination
strategy
,
as
target
translations
are
combined
independent
of
source
sentences
.
(
Rosti
et
al.
,
2007
)
extracts
phrase
translation
pairs
in
the
phrase
level
combination
.
Our
proposed
method
incorporates
bilingual
information
from
source
and
target
sentences
in
a
hierarchical
framework
:
word
,
phrase
and
decoding
path
combinations
.
Such
information
proves
very
helpful
in
our
experiments
.
We
also
developed
a
path
matching
cost
function
to
encourage
decoding
path
imitation
,
thus
enable
one
decoder
to
take
advantage
of
rich
reordering
models
of
other
MT
systems
.
We
only
combine
top-one
hypothesis
from
each
system
,
and
did
not
apply
system
confidence
measure
and
minimum
error
rate
training
to
tune
system
combination
weights
.
This
will
be
our
future
work
.
6
Conclusion
Our
hierarchical
system
combination
strategy
effectively
integrates
word
and
phrase
translation
combinations
,
decoding
path
imitation
and
sentence
hypothesis
selection
from
multiple
MT
systems
.
By
boosting
common
word
and
phrase
translation
pairs
and
pruning
unused
ones
,
we
obtain
better
translation
quality
with
less
re-decoding
time
.
By
imitating
the
decoding
paths
,
we
take
advantage
ofvarious
reordering
schemes
from
different
decoders
.
The
sentence
hypothesis
selection
based
on
N-gram
language
model
further
improves
the
translation
quality
.
The
effectiveness
has
been
consistently
proved
in
several
empirical
studies
with
test
sets
in
different
languages
and
covering
different
genres
.
7
Acknowledgment
The
authors
would
like
to
thank
Yaser
Al-Onaizan
,
Abraham
Ittycheriah
and
Salim
Roukos
for
helpful
discussions
and
suggestions
.
This
work
is
supported
under
the
DARPA
GALE
project
,
contract
No.
HR0011-06-2-0001
.
