We
present
a
method
for
improving
word
alignment
for
statistical
syntax-based
machine
translation
that
employs
a
syntactically
informed
alignment
model
closer
to
the
translation
model
than
commonly-used
word
alignment
models
.
This
leads
to
extraction
of
more
useful
linguistic
patterns
and
improved
BLEU
scores
on
translation
experiments
in
Chinese
and
Arabic
.
1
Methods
of
statistical
MT
Roughly
speaking
,
there
are
two
paths
commonly
taken
in
statistical
machine
translation
(
Figure
1
)
.
The
idealistic
path
uses
an
unsupervised
learning
algorithm
such
as
EM
(
Demptser
et
al.
,
1977
)
to
learn
parameters
for
some
proposed
translation
model
from
a
bitext
training
corpus
,
and
then
directly
translates
using
the
weighted
model
.
Some
examples
of
the
idealistic
approach
are
the
direct
IBM
word
model
(
Berger
et
al.
,
1994
;
Germann
et
al.
,
2001
)
,
the
phrase-based
approach
of
Marcu
and
Wong
(
2002
)
,
and
the
syntax
approaches
of
Wu
(
1996
)
and
Yamada
and
Knight
(
2001
)
.
Idealistic
approaches
are
conceptually
simple
and
thus
easy
to
relate
to
observed
phenomena
.
However
,
as
more
parameters
are
added
to
the
model
the
idealistic
approach
has
not
scaled
well
,
for
it
is
increasingly
difficult
to
incorporate
large
amounts
of
training
data
efficiently
over
an
increasingly
large
search
space
.
Additionally
,
the
EM
procedure
has
a
tendency
to
overfit
its
training
data
when
the
input
units
have
varying
explanatory
powers
,
such
as
variable-size
phrases
or
variable-height
trees
.
The
realistic
path
also
learns
a
model
of
translation
,
but
uses
that
model
only
to
obtain
Viterbi
word-for-word
alignments
for
the
training
corpus
.
The
bitext
and
corresponding
alignments
are
then
used
as
input
to
a
pattern
extraction
algorithm
,
which
yields
a
set
of
patterns
or
rules
for
a
second
translation
model
(
which
often
has
a
wider
parameter
space
than
that
used
to
obtain
the
word-for-word
alignments
)
.
Weights
for
the
second
model
are
then
set
,
typically
by
counting
and
smoothing
,
and
this
weighted
model
is
used
for
translation
.
Realistic
approaches
scale
to
large
data
sets
and
have
yielded
better
BLEU
performance
than
their
idealistic
counterparts
,
but
there
is
a
disconnect
between
the
first
model
(
hereafter
,
the
alignment
model
)
and
the
second
(
the
translation
model
)
.
Examples
of
realistic
systems
are
the
phrase-based
ATS
system
of
Och
and
Ney
(
2004
)
,
the
phrasal-syntax
hybrid
system
Hiero
(
Chiang
,
2005
)
,
and
the
GHKM
syntax
system
(
Galley
et
al.
,
2004
;
Galley
et
al.
,
2006
)
.
For
an
alignment
model
,
most
of
these
use
the
Aachen
HMM
approach
(
Vogel
et
al.
,
1996
)
,
the
implementation
of
IBM
Model
4
in
GIZA++
(
Och
and
Ney
,
2000
)
or
,
more
recently
,
the
semi-supervised
EMD
algorithm
(
Fraser
and
Marcu
,
2006
)
.
The
two-model
approach
of
the
realistic
path
has
undeniable
empirical
advantages
and
scales
to
large
data
sets
,
but
new
research
tends
to
focus
on
development
of
higher
order
translation
models
that
are
informed
only
by
low-order
alignments
.
We
would
like
to
add
the
analytic
power
gained
from
modern
translation
models
to
the
underlying
alignment
model
without
sacrificing
the
efficiency
and
empirical
gains
of
the
two-model
approach
.
By
adding
the
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
36G-368
,
Prague
,
June
2GG7
.
©
2GG7
Association
for
Computational
Linguistics
target
sentences
source
sentences
unsupervised
learning
Idealistic
System
Realistic
System
Figure
1
:
General
approach
to
idealistic
and
realistic
statistical
MT
systems
syntactic
information
used
in
the
translation
model
to
our
alignment
model
we
may
improve
alignment
quality
such
that
rule
quality
and
,
in
turn
,
system
quality
are
improved
.
In
the
remainder
of
this
work
we
show
how
a
touch
of
idealism
can
improve
an
existing
realistic
syntax-based
translation
system
.
2
Multi-level
syntactic
rules
for
syntax
MT
Galley
et
al.
(
2004
)
and
Galley
et
al.
(
2006
)
describe
a
syntactic
translation
model
that
relates
English
trees
to
foreign
strings
.
The
model
describes
joint
production
of
a
(
tree
,
string
)
pair
via
a
non-deterministic
selection
of
weighted
rules
.
Each
rule
has
an
English
tree
fragment
with
variables
and
a
corresponding
foreign
string
fragment
with
the
same
variables
.
A
series
of
rules
forms
an
explanation
(
or
derivation
)
of
the
complete
pair
.
As
an
example
,
consider
the
parsed
English
and
corresponding
Chinese
at
the
top
of
Figure
2
.
The
three
columns
underneath
the
example
are
different
rule
sequences
that
can
explain
this
pair
;
there
are
many
other
possibilities
.
Note
how
rules
specify
rotation
(
e.g.
R10
,
R5
)
,
direct
translation
(
R12
,
R8
)
,
insertion
and
deletion
(
R11
,
R1
)
,
and
tree
traversal
(
R7
,
R15
)
.
Note
too
that
the
rules
explain
variable
-
size
fragments
(
e.g.
R6
vs.
R14
)
and
thus
the
possible
derivation
trees
of
rules
that
explain
a
sentence
pair
have
varying
sizes
.
The
smallest
such
derivation
tree
has
a
single
large
rule
(
which
does
not
appear
in
Figure
2
;
we
leave
the
description
of
such
a
rule
as
an
exercise
for
the
reader
)
.
A
string-to-tree
decoder
constructs
a
derivation
forest
of
derivation
trees
where
the
right
sides
of
the
rules
in
a
tree
,
taken
together
,
explain
a
candidate
source
sentence
.
It
then
outputs
the
English
tree
corresponding
to
the
highest-scoring
derivation
in
the
forest
.
3
Introducing
syntax
into
the
alignment
We
now
lay
the
ground
for
a
syntactically
motivated
alignment
model
.
We
begin
by
reviewing
an
alignment
model
commonly
seen
in
realistic
MT
systems
and
compare
it
to
a
syntactically-aware
alignment
model
.
3.1
The
traditional
IBM
alignment
model
IBM
Model
4
(
Brown
et
al.
,
1993
)
learns
a
set
of
4
probability
tables
to
compute
p
(
f
|
e
)
given
a
foreign
sentence
f
and
its
target
translation
e
via
the
following
(
greatly
simplified
)
generative
story
:
taiwan
in
two-shores
trade
middle
surplus
NNP
POS
surplus
in
NPB
PP
trade
between
NPB
taiwan
'
s
between
NPB
DT
CD
NNS
the
two
shores
Figure
2
:
A
(
English
tree
,
Chinese
string
)
pair
and
three
different
sets
of
multilevel
tree-to-string
rules
that
can
explain
it
;
the
first
set
is
obtained
from
bootstrap
alignments
,
the
second
from
this
paper
's
re-alignment
procedure
,
and
the
third
is
a
viable
,
if
poor
quality
,
alternative
that
is
not
learned
.
guangxi
'
s
OUTSIDE-WORLD
JJ
NN
I
I
outside
world
OPENING-UP
NPB
VBG
PRT
PP
the
outside
world
opening
RP
Figure
3
:
The
impact
of
a
bad
alignment
on
rule
extraction
.
Including
the
alignment
link
indicated
by
the
dotted
line
in
the
example
leads
to
the
rule
set
in
the
second
row
.
The
re-alignment
procedure
described
in
Section
3.2
learns
to
prefer
the
rule
set
at
bottom
,
which
omits
the
bad
link
.
A
fertility
y
for
each
word
ei
in
e
is
chosen
with
probability
Pfert
(
y
\
ei
)
.
A
null
word
is
inserted
next
to
each
fertility-expanded
word
with
probability
Pnull
.
Each
token
ei
in
the
fertility-expanded
word
and
null
string
is
translated
into
some
foreign
word
fi
in
f
with
probability
The
position
of
each
foreign
word
fi
that
was
translated
from
ei
is
changed
by
A
(
which
may
be
positive
,
negative
,
or
zero
)
with
probability
Vdistortian
{
A
\
A
{
ei
)
,
B
(
f
)
)
,
where
A
and
B
are
functions
over
the
source
and
target
vocabularies
,
respectively
.
Brown
et
al.
(
1993
)
describes
an
EM
algorithm
for
estimating
values
for
the
four
tables
in
the
generative
story
.
However
,
searching
the
space
of
all
possible
alignments
is
intractable
for
EM
,
so
in
practice
the
procedure
is
bootstrapped
by
models
with
narrower
search
space
such
as
IBM
Model
1
(
Brown
et
al.
,
1993
)
or
Aachen
HMM
(
Vogel
et
al.
,
1996
)
.
Now
let
us
contrast
this
commonly
used
model
for
obtaining
alignments
with
a
syntactically
motivated
alternative
.
We
recall
the
rules
described
in
Section
2
.
Our
model
learns
a
single
probability
table
to
compute
p
(
etree
,
f
)
given
a
foreign
sentence
f
and
a
parsed
target
translation
etree
.
In
the
following
generative
story
we
assume
a
starting
variable
with
syntactic
type
v.
Choose
a
rule
r
to
replace
v
,
with
probability
Prule
{
r
\
v
)
.
For
each
variable
with
syntactic
type
vi
in
the
partially
completed
(
tree
,
string
)
pair
,
continue
to
choose
rules
ri
with
probability
prule
(
ri
\
vi
)
to
replace
these
variables
until
there
are
no
variables
remaining
.
In
Section
5.1
we
discuss
an
EM
learning
procedure
for
estimating
these
rule
probabilities
.
As
in
the
IBM
approach
,
we
must
mitigate
intractability
by
limiting
the
parameter
space
searched
,
which
is
potentially
much
wider
than
in
the
word-to-word
case
.
We
would
like
to
supply
to
EM
all
possible
rules
that
explain
the
training
data
,
but
this
implies
a
rule
relating
each
possible
tree
fragment
to
each
possible
string
fragment
,
which
is
infeasible
.
We
follow
the
approach
of
bootstrapping
from
a
model
with
a
narrower
parameter
space
as
is
done
in
,
e.g.
Och
and
Ney
(
2000
)
and
Fraser
and
Marcu
(
2006
)
.
To
reduce
the
model
space
we
employ
the
rule
acquisition
technique
of
Galley
et
al.
(
2004
)
,
which
obtains
rules
given
a
(
tree
,
string
)
pair
as
well
as
an
initial
alignment
between
them
.
We
are
agnostic
about
the
source
of
this
bootstrap
alignment
and
in
Section
5
present
results
based
on
several
different
bootstrap
alignment
qualities
.
We
require
an
initial
set
of
alignments
,
which
we
obtain
from
a
word-for-word
alignment
procedure
such
as
GIZA++
or
EMD
.
Thus
,
we
are
not
aligning
input
data
,
but
rather
re-aligning
it
with
a
syntax
model
.
4
The
appeal
of
a
syntax
alignment
model
Consider
the
example
of
Figure
2
again
.
The
leftmost
derivation
is
obtained
from
the
bootstrap
alignment
set
.
This
derivation
is
reasonable
but
there
are
some
poorly
motivated
rules
,
from
a
linguistic
standpoint
.
The
Chinese
word
roughly
means
"
the
sentence
pairs
description
Chinese
Arabic
Table
1
:
Tuning
and
testing
data
sets
for
the
MT
system
described
in
Section
5.2
.
two
shores
"
in
this
context
,
but
the
rule
R6
learned
from
the
alignment
incorrectly
includes
"
between
"
.
However
,
other
sentences
in
the
training
corpus
have
the
correct
alignment
,
which
yields
rule
R16
.
Meanwhile
,
rules
R13
and
R14
,
learned
from
yet
other
sentences
in
the
training
corpus
,
handle
the
.
.
.
structure
(
which
roughly
translates
to
"
in
between
"
)
,
thus
allowing
the
middle
derivation
.
EM
distributes
rule
probabilities
in
such
a
way
as
to
maximize
the
probability
of
the
training
corpus
.
It
thus
prefers
to
use
one
rule
many
times
instead
of
several
different
rules
for
the
same
situation
over
several
sentences
,
if
possible
.
R6
is
a
possible
rule
in
46
of
the
329,031
sentence
pairs
in
the
training
corpus
,
while
R16
is
a
possible
rule
in
100
sentence
pairs
.
Well-formed
rules
are
more
usable
than
ill-formed
rules
and
the
partial
alignments
behind
these
rules
,
generally
also
well-formed
,
become
favored
as
well
.
The
top
row
of
Figure
3
contains
an
example
of
an
alignment
learned
by
the
bootstrap
alignment
model
that
includes
an
incorrect
link
.
Rule
R24
,
which
is
extracted
from
this
alignment
,
is
a
poor
rule
.
A
set
of
commonly
seen
rules
learned
from
other
training
sentences
provide
a
more
likely
explanation
of
the
data
,
and
the
consequent
alignment
omits
the
spurious
link
.
5
Experiments
In
this
section
,
we
describe
the
implementation
of
our
semi-idealistic
model
and
our
means
of
evaluating
the
resulting
re-alignments
in
an
MT
task
.
We
begin
with
a
training
corpus
of
Chinese-English
and
Arabic-English
bitexts
,
the
English
side
parsed
by
a
reimplementation
ofthe
standard
Collins
model
(
Bikel
,
2004
)
.
In
order
to
acquire
a
syntactic
rule
set
,
we
also
need
a
bootstrap
alignment
of
each
training
sentence
.
We
use
an
implementation
of
the
GHKM
bootstrap
GIZA
corpus
English
words
Chinese
words
re-al
type
ignment
experiment
baseline
initial
adjusted
Table
2
:
A
comparison
of
Chinese
BLEU
performance
between
the
GIZA
baseline
(
no
re-alignment
)
,
realignment
as
proposed
in
Section
3.2
,
and
re-alignment
as
modified
in
Section
5.4
algorithm
(
Galley
et
al.
,
2004
)
to
obtain
a
rule
set
for
each
bootstrap
alignment
.
Now
we
need
an
EM
algorithm
for
learning
the
parameters
of
the
rule
set
that
maximize
Y
[
p
(
tree
,
string
)
.
Such
an
algorithm
is
pre
-
sented
by
Graehl
and
Knight
(
2004
)
.
The
algorithm
consists
of
two
components
:
Deriv
,
which
is
a
procedure
for
constructing
a
packed
forest
of
derivation
trees
of
rules
that
explain
a
(
tree
,
string
)
bitext
corpus
given
that
corpus
and
a
rule
set
,
and
Train
,
which
is
an
iterative
parameter-setting
procedure
.
We
initially
attempted
to
use
the
top-down
De-riv
algorithm
of
Graehl
and
Knight
(
2004
)
,
but
as
the
constraints
of
the
derivation
forests
are
largely
lexical
,
too
much
time
was
spent
on
exploring
deadends
.
Instead
we
build
derivation
forests
using
the
following
sequence
of
operations
:
Binarize
rules
using
the
synchronous
bina-rization
algorithm
for
tree-to-string
transducers
described
in
Zhang
et
al.
(
2006
)
.
Construct
a
parse
chart
with
a
CKY
parser
simultaneously
constrained
on
the
foreign
string
and
English
tree
,
similar
to
the
bilingual
parsing
of
Wu
(
1997
)
1
.
Recover
all
reachable
edges
by
traversing
the
chart
,
starting
from
the
topmost
entry
.
Since
the
chart
is
constructed
bottom-up
,
leaf
lexical
constraints
are
encountered
immediately
,
resulting
in
a
narrower
search
space
and
faster
running
time
than
the
top-down
Deriv
algorithm
for
this
application
.
Derivation
forest
construction
takes
around
400
hours
of
cumulative
machine
time
(
4-processor
machines
)
for
Chinese
.
The
actual
running
of
EM
iterations
(
which
directly
implements
the
Train
algorithm
of
Graehl
and
Knight
(
2004
)
)
1In
the
cases
where
a
rule
is
not
synchronous-binarizable
standard
left-right
binarization
is
performed
and
proper
permutation
of
the
disjoint
English
tree
spans
must
be
verified
when
building
the
part
of
the
chart
that
uses
this
rule
.
takes
about
10
minutes
,
after
which
the
viterbi
derivation
trees
are
directly
recoverable
.
The
viterbi
derivation
tree
tells
us
which
English
words
produce
which
Chinese
words
,
so
we
can
extract
a
word-to-word
alignment
from
it
.
We
summarize
the
approach
described
in
this
paper
as
:
Obtain
bootstrap
alignments
for
a
training
corpus
using
GIZA++
.
Extract
rules
from
the
corpus
and
alignments
using
GHKM
,
noting
the
partial
alignment
that
is
used
to
extract
each
rule
.
Construct
derivation
forests
for
each
(
tree
,
string
)
pair
,
ignoring
the
alignments
,
and
run
EM
to
obtain
viterbi
derivation
trees
,
then
use
the
annotated
partial
alignments
to
obtain
viterbi
alignments
.
Use
the
new
alignments
as
input
to
the
MT
system
described
below
.
A
truly
idealistic
MT
system
would
directly
apply
the
rule
weight
parameters
learned
via
EM
to
a
machine
translation
task
.
As
mentioned
in
Section
1
,
we
maintain
the
two-model
,
or
realistic
approach
.
Below
we
briefly
describe
the
translation
model
,
focusing
on
comparison
with
the
previously
described
alignment
model
.
Galley
et
al.
(
2006
)
provides
a
more
complete
description
of
the
translation
model
and
DeNeefe
et
al.
(
2007
)
provides
a
more
complete
description
of
the
end-to-end
translation
pipeline
.
Although
in
principle
the
re-alignment
model
and
translation
model
learn
parameter
weights
over
the
same
rule
space
,
in
practice
we
limit
the
rules
used
for
re-alignment
to
the
set
of
smallest
rules
that
explain
the
training
corpus
and
are
consistent
with
the
bootstrap
alignments
.
This
is
a
compromise
made
to
reduce
the
search
space
for
EM
.
The
translation
model
learns
multiple
derivations
of
rules
consistent
with
the
re-alignments
for
each
sentence
,
and
learns
bootstrap
giza
corpus
English
words
Chinese
words
re-alig
type
nment
experiment
rules
tune
test
BOOTSTRAP
GIZA
CORPUS
ENGLISH
WORDS
ARABIC
WORDS
RE-ALIG
TYPE
NMENT
EXPERIMENT
RULES
TUNE
TEST
baseline
re-alignment
Table
3
:
Machine
Translation
experimental
results
evaluated
with
case-insensitive
BLEU4
.
weights
for
these
by
counting
and
smoothing
.
A
dozen
other
features
are
also
added
to
the
rules
.
We
obtain
weights
for
the
combinations
of
the
features
by
performing
minimum
error
rate
training
(
Och
,
2003
)
on
held-out
data
.
We
then
use
a
CKY
decoder
to
translate
unseen
testdata
using
the
rules
and
tuned
weights
.
Table
1
summarizes
the
data
used
in
tuning
and
testing
.
An
initial
re-alignment
experiment
shows
a
reasonable
rise
in
BLEU
scores
from
the
baseline
(
Table
2
)
,
but
closer
inspection
of
the
rules
favored
by
EM
implies
we
can
do
even
better
.
EM
has
a
tendency
to
favor
few
large
rules
over
many
small
rules
,
even
when
the
small
rules
are
more
useful
.
Referring
to
the
rules
in
Figure
2
,
note
that
possible
derivations
for
(
taiwan
's
,
?
f
)
2
are
R2
,
R11-R12
,
and
R17-R18
.
Clearly
the
third
derivation
is
not
desirable
,
and
we
do
not
discuss
it
further
.
Between
the
first
two
derivations
,
R11-R12
is
preferred
over
R2
,
as
the
conditioning
for
possessive
insertion
is
not
related
to
the
specific
Chinese
word
being
inserted
.
Of
the
1,902
sentences
in
the
training
corpus
where
this
pair
is
seen
,
the
bootstrap
alignments
yield
the
R2
derivation
1,649
times
and
the
R11-R12
derivation
0
times
.
Re-alignment
does
not
change
the
result
much
;
the
new
alignments
yield
the
R2
derivation
1,613
times
and
again
never
choose
R11-R12
.
The
rules
in
the
second
derivation
themselves
are
2The
Chinese
gloss
is
simply
"
taiwan
"
.
not
rarely
seen
-
R11
is
in
13,311
forests
other
than
those
where
R2
is
seen
,
and
R12
is
in
2,500
additional
forests
.
EM
gives
R11
a
probability
of
e-7
72
-
better
than
98.7
%
of
rules
,
and
R12
a
probability
of
e-2
96
.
But
R2
receives
a
probability
of
e-6
32
and
is
preferred
over
the
R11-R12
derivation
,
which
has
a
combined
probability
of
e-10.68
.
The
preference
for
shorter
derivations
containing
large
rules
over
longer
derivations
containing
small
rules
is
due
to
a
general
tendency
for
EM
to
prefer
derivations
with
few
atoms
.
Marcu
and
Wong
(
2002
)
note
this
preference
but
consider
the
phenomenon
a
feature
,
rather
than
a
bug
.
Zollmann
and
Sima'an
(
2005
)
combat
the
overfitting
aspect
for
parsing
by
using
a
held-out
corpus
and
a
straight
maximum
likelihood
estimate
,
rather
than
EM
.
We
take
a
modeling
approach
to
the
phenomenon
.
As
the
probability
of
a
derivation
is
determined
by
the
product
of
its
atom
probabilities
,
longer
derivations
with
more
probabilities
to
multiply
have
an
inherent
disadvantage
against
shorter
derivations
,
all
else
being
equal
.
EM
is
an
iterative
procedure
and
thus
such
a
bias
can
lead
the
procedure
to
converge
with
artificially
raised
probabilities
for
short
derivations
and
the
large
rules
that
comprise
them
.
The
relatively
rare
applicability
of
large
rules
(
and
thus
lower
observed
partial
counts
)
does
not
overcome
the
inherent
advantage
of
large
coverage
.
To
combat
this
,
we
introduce
size
terms
into
our
generative
story
,
ensuring
that
all
competing
derivations
for
the
language
pair
Chinese-English
Arabic-English
baseline
EMD
re-align
Table
4
:
Re-alignment
performance
with
semi-supervised
EMD
bootstrap
alignments
same
sentence
contain
the
same
number
of
atoms
:
Choose
a
rule
size
s
with
cost
csize
(
s
)
s-1
.
Choose
a
rule
r
(
of
size
s
)
to
replace
the
start
symbol
with
probability
prule
(
r
\
s
,
v
)
.
For
each
variable
in
the
partially
completed
(
tree
,
string
)
pair
,
continue
to
choose
sizes
followed
by
rules
,
recursively
to
replace
these
variables
until
there
are
no
variables
remaining
.
This
generative
story
changes
the
derivation
comparison
from
R2
vs
R11-R12
to
S2-R2
vs
R11-R12
,
where
S2
is
the
atom
that
represents
the
choice
of
size
2
(
the
size
of
a
rule
in
this
context
is
the
number
of
non-leaf
and
non-root
nodes
in
its
tree
fragment
)
.
Note
that
the
variable
number
of
inclusions
implied
by
the
exponent
in
the
generative
story
above
ensures
that
all
derivations
have
the
same
size
.
For
example
,
a
derivation
with
one
size-3
rule
,
a
derivation
with
one
size-2
and
one
size-1
rule
,
and
a
derivation
with
three
size-1
rules
would
each
have
three
atoms
.
With
this
revised
model
that
allows
for
fair
comparison
of
derivations
,
the
R11-R12
derivation
is
chosen
1636
times
,
and
S2-R2
is
not
chosen
.
R2
does
,
however
,
appear
in
the
translation
model
,
as
the
expanded
rule
extraction
described
in
Section
5.2
creates
R2
by
joining
R11
and
R12
.
The
probability
of
size
atoms
,
like
that
of
rule
atoms
,
is
decided
by
EM
.
The
revised
generative
story
tends
to
encourage
smaller
sizes
by
virtue
of
the
exponent
.
This
does
not
,
however
,
simply
ensure
the
largest
number
of
rules
per
derivation
is
used
in
all
cases
.
Ill-fitting
and
poorly-motivated
rules
such
as
R22
,
R23
,
and
R24
in
Figure
2
are
not
preferred
over
R16
,
even
though
they
are
smaller
.
However
,
R14
and
R16
are
preferred
over
R6
,
as
the
former
are
useful
rules
.
Although
the
modified
model
does
not
sum
to
1
,
it
leads
to
an
improvement
in
BLEU
score
,
as
can
be
seen
in
the
last
row
of
Table
2
.
We
performed
primary
experiments
on
two
different
bootstrap
setups
in
two
languages
:
the
initial
experiment
uses
the
same
data
set
for
the
GIZA++
initial
alignment
as
is
used
in
the
re-alignment
,
while
an
experiment
on
better
quality
bootstrap
alignments
uses
a
much
larger
data
set
.
For
each
bootstrapping
in
each
language
we
compared
the
baseline
of
using
these
alignments
directly
in
an
MT
system
with
the
experiment
of
using
the
alignments
obtained
from
the
re-alignment
procedure
described
in
Section
5.4
.
For
each
experiment
we
report
:
the
number
of
rules
extracted
by
the
expanded
GHKM
algorithm
of
Galley
et
al.
(
2006
)
for
the
translation
model
,
converged
BLEU
scores
on
the
tuning
set
,
and
finally
BLEU
performance
on
the
held-out
test
set
.
Data
set
specifics
for
the
GIZA++
bootstrapping
and
BLEU
results
are
summarized
in
Table
3
.
The
results
presented
demonstrate
we
are
able
to
improve
on
unsupervised
GIZA++
alignments
by
about
1
BLEU
point
for
Chinese
and
around
0.4
BLEU
point
for
Arabic
using
an
additional
unsu-pervised
algorithm
that
requires
no
human
aligned
data
.
If
human-aligned
data
is
available
,
the
EMD
algorithm
provides
higher
baseline
alignments
than
GIZA++
that
have
led
to
better
MT
performance
(
Fraser
and
Marcu
,
2006
)
.
As
a
further
experiment
we
repeated
the
experimental
conditions
from
Table
3
,
this
time
bootstrapped
with
the
semi-supervised
EMD
method
,
which
uses
the
larger
bootstrap
GIZA
corpora
described
in
Table
3
and
an
additional
64,469
/
48,650
words
of
hand-aligned
English-Chinese
and
43,782
/
31,457
words
of
hand-aligned
English-Arabic
.
The
results
ofthis
advanced
experiment
are
in
Table
4
.
We
show
a
0.42
gain
in
BLEU
for
Arabic
,
but
no
movementforChinese
.
We
believe
increasing
the
size
of
the
re-alignment
corpora
will
increase
BLEU
gains
in
this
experimental
condition
,
but
leave
those
results
for
future
work
.
We
can
see
from
the
results
presented
that
the
impact
of
the
syntax-aware
re-alignment
procedure
of
Section
3.2
,
coupled
with
the
addition
of
size
parameters
to
the
generative
story
from
Section
5.4
serves
to
remove
links
from
the
bootstrap
alignments
that
cause
less
useful
rules
to
be
extracted
,
and
thus
increase
the
overall
quality
of
the
rules
,
and
hence
the
system
performance
.
We
thus
see
the
benefit
to
including
syntax
in
an
alignment
model
,
bringing
the
two
models
of
the
realistic
machine
translation
path
somewhat
closer
together
.
Acknowledgments
We
thank
David
Chiang
,
Steve
DeNeefe
,
Alex
Fraser
,
Victoria
Fossum
,
Jonathan
Graehl
,
Liang
Huang
,
Daniel
Marcu
,
Michael
Pust
,
Oana
Pos-tolache
,
Michael
Pust
,
Jason
Riesa
,
Jens
Vockler
,
and
Wei
Wang
for
help
and
discussion
.
This
research
was
supported
by
NSF
(
grant
IIS-0428020
)
and
DARPA
(
contract
HR0011-06-C-0022
)
.
