Word
alignment
is
the
problem
of
annotating
parallel
text
with
translational
correspondence
.
Previous
generative
word
alignment
models
have
made
structural
assumptions
such
as
the
1-to-1
,
1-to-N
,
or
phrase-based
consecutive
word
assumptions
,
while
previous
discriminative
models
have
either
made
such
an
assumption
directly
or
used
features
derived
from
a
generative
model
making
one
of
these
assumptions
.
We
present
a
new
generative
alignment
model
which
avoids
these
structural
limitations
,
and
show
that
it
is
effective
when
trained
using
both
unsuper-vised
and
semi-supervised
training
methods
.
1
Introduction
Several
generative
models
and
a
large
number
of
discriminatively
trained
models
have
been
proposed
in
the
literature
to
solve
the
problem
of
automatic
word
alignment
of
bitexts
.
The
generative
proposals
have
required
unrealistic
assumptions
about
the
structure
of
the
word
alignments
.
Two
assumptions
are
particularly
common
.
The
first
is
the
1-to-N
assumption
,
meaning
that
each
source
word
generates
zero
or
more
target
words
,
which
requires
heuristic
techniques
in
order
to
obtain
alignments
suitable
for
training
a
SMT
system
.
The
second
is
the
consecutive
word-based
"
phrasal
SMT
"
assumption
.
This
does
not
allow
gaps
,
which
can
be
used
to
particular
advantage
by
SMT
models
which
model
hierarchical
structure
.
Previous
discriminative
models
have
either
made
such
assumptions
directly
or
used
fea
-
tures
from
a
generative
model
making
such
an
assumption
.
Our
objective
is
to
automatically
produce
alignments
which
can
be
used
to
build
high
quality
machine
translation
systems
.
These
are
presumably
close
to
the
alignments
that
trained
bilingual
speakers
produce
.
Human
annotated
alignments
often
contain
M-to-N
alignments
,
where
several
source
words
are
aligned
to
several
target
words
and
the
resulting
unit
can
not
be
further
decomposed
.
Source
or
target
words
in
a
single
unit
are
sometimes
non-consecutive
.
In
this
paper
,
we
describe
a
new
generative
model
which
directly
models
M-to-N
non-consecutive
word
alignments
.
The
rest
of
the
paper
is
organized
as
follows
.
The
generative
story
is
presented
,
followed
by
the
mathematical
formulation
.
Details
of
the
unsupervised
training
procedure
are
described
.
The
generative
model
is
then
decomposed
into
feature
functions
used
in
a
log-linear
model
which
is
trained
using
a
semi-supervised
algorithm
.
Experiments
show
improvements
in
word
alignment
accuracy
and
usage
of
the
generated
alignments
in
hierarchical
and
phrasal
SMT
systems
results
in
an
increased
BLEU
score
.
Previous
work
is
discussed
and
this
is
followed
by
the
conclusion
.
2
LEAF
:
a
generative
word
alignment
model
We
introduce
a
new
generative
story
which
enables
the
capture
of
non-consecutive
M-to-N
alignment
structure
.
We
have
attempted
to
use
the
same
labels
as
the
generative
story
for
Model
4
(
Brown
et
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
51-60
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
al.
,
1993
)
,
which
we
are
extending
.
Our
generative
story
describes
the
stochastic
generation
of
a
target
string
f
(
sometimes
referred
to
as
the
French
string
,
or
foreign
string
)
from
a
source
string
e
(
sometimes
referred
to
as
the
English
string
)
,
consisting
of
l
words
.
The
variable
m
is
the
length
of
f.
We
generally
use
the
index
i
to
refer
to
source
words
(
ei
is
the
English
word
at
position
i
)
,
and
j
to
refer
to
target
words
.
Our
generative
story
makes
the
distinction
between
different
types
of
source
words
.
There
are
head
words
,
non-head
words
,
and
deleted
words
.
Similarly
,
for
target
words
,
there
are
head
words
,
non-head
words
,
and
spurious
words
.
A
head
word
is
linked
to
zero
or
more
non-head
words
;
each
nonhead
word
is
linked
to
from
exactly
one
head
word
.
The
purpose
of
head
words
is
to
try
to
provide
a
robust
representation
of
the
semantic
features
necessary
to
determine
translational
correspondence
.
This
is
similar
to
the
use
of
syntactic
head
words
in
statistical
parsers
to
provide
a
robust
representation
of
the
syntactic
features
of
a
parse
sub-tree
.
A
minimal
translational
correspondence
consists
of
a
linkage
between
a
source
head
word
and
a
target
head
word
(
and
by
implication
,
the
non-head
words
linked
to
them
)
.
Deleted
source
words
are
not
involved
in
a
minimal
translational
correspondence
,
as
they
were
"
deleted
"
by
the
translation
process
.
Spurious
target
words
are
also
not
involved
in
a
minimal
translational
correspondence
,
as
they
spontaneously
appeared
during
the
generation
of
other
target
words
.
Figure
1
shows
a
simple
example
of
the
stochastic
generation
of
a
French
sentence
from
an
English
sentence
,
annotated
with
the
step
number
in
the
generative
story
.
Choose
the
source
word
type
.
Choose
the
identity
of
the
head
word
for
each
non-head
word
.
Choose
the
identity
of
the
generated
target
head
word
for
each
source
head
word
.
Choose
the
number
of
words
in
a
target
cept
conditioned
on
the
identity
of
the
source
head
word
and
the
source
cept
size
(
7i
is
1
if
the
cept
size
is
1
,
and
2
if
the
cept
size
is
greater
)
.
Choose
the
number
of
spurious
words
.
choose
ip0
according
to
the
distribution
Choose
the
identity
of
the
spurious
words
.
Choose
the
identity
of
the
target
non-head
words
linked
to
each
target
head
word
.
Choose
the
position
of
the
target
head
and
nonhead
words
.
absolutely
[
comma
]
they
do
not
DEL
.
DEL
.
HEAD
non-head
HEAD
THEY
ILS
ILS
ILS
ILS
PAS
ne
ne
want
to
spend
that
money
HEAD
non-head
HEAD
HEAD
HEAD
DESIRENT
DEPENSER
CET
ARGENT
ne
DESIRENT
DEPENSER
CET
DESIRENT
PAS
DEPENSER
CET
DESIRENT
PAS
DEPENSER
CET
ARGENT
ARGENT
ARGENT
aujourd'hui
Figure
1
:
Generative
story
example
,
(
number
)
indicates
step
number
if
any
position
was
chosen
twice
,
return
"
failure
"
Choose
the
position
of
the
spuriously
generated
words
.
We
note
that
the
steps
which
return
"
failure
"
are
required
because
the
model
is
deficient
.
Deficiency
means
that
a
portion
of
the
probability
mass
in
the
model
is
allocated
towards
generative
stories
which
would
result
in
infeasible
alignment
structures
.
Our
model
has
deficiency
in
the
non-spurious
target
word
placement
,
just
as
Model
4
does
.
It
has
additional
deficiency
in
the
source
word
linking
decisions
.
(
Och
and
Ney
,
2003
)
presented
results
suggesting
that
the
additional
parameters
required
to
ensure
that
a
model
is
not
deficient
result
in
inferior
performance
,
but
we
plan
to
study
whether
this
is
the
case
for
our
generative
model
in
future
work
.
Given
e
,
f
and
a
candidate
alignment
a
,
which
represents
both
the
links
between
source
and
target
head-words
and
the
head-word
connections
of
the
non-head
words
,
we
would
like
to
calculate
p
(
f
,
a
\
e
)
.
The
formula
for
this
is
:
5
(
i
,
ir
)
is
the
Kronecker
delta
function
which
is
equal
to
1
if
i
=
i
'
and
0
otherwise
.
pi
is
the
position
of
the
closest
English
head
word
to
the
left
of
the
word
at
i
or
0
if
there
is
no
such
word
.
classe
(
ei
)
is
the
word
class
of
the
English
word
at
position
i
,
class
/
(
fj
)
is
the
word
class
of
the
French
word
at
position
j
,
classy
(
fj
)
is
the
word
class
of
the
French
head
word
at
position
j.
p0
and
pi
are
parameters
describing
the
probability
of
not
generating
and
of
generating
a
target
spurious
word
from
each
non-spurious
target
word
,
Po
+
Pi
=
1
.
nik
fipi
)
The
alignment
structure
used
in
many
other
models
can
be
modeled
using
special
cases
ofthis
framework
.
We
can
express
the
1-to-N
structure
of
models
like
Model
4
by
disallowing
\
i
=
—
1
,
while
for
1-to-l
structure
we
both
disallow
\
i
=
—
1
and
de-terministically
set
^i
=
%
i.
We
can
also
specialize
our
generative
story
to
the
consecutive
word
M-to-N
alignments
used
in
"
phrase-based
"
models
,
though
in
this
case
the
conditioning
of
the
generation
decisions
would
be
quite
different
.
This
involves
adding
checks
on
source
and
target
connection
geometry
to
the
generative
story
which
,
if
violated
,
would
return
"
failure
"
;
naturally
this
is
at
the
cost
of
additional
deficiency
.
2.2
Unsupervised
Parameter
Estimation
We
can
perform
maximum
likelihood
estimation
of
the
parameters
of
this
model
in
a
similar
fashion
to
that
of
Model
4
(
Brown
et
al.
,
1993
)
,
described
thoroughly
in
(
Och
and
Ney
,
2003
)
.
We
use
Viterbi
training
(
Brown
et
al.
,
1993
)
but
neighborhood
estimation
(
Al-Onaizan
et
al.
,
1999
;
Och
and
Ney
,
2003
)
or
"
pegging
"
(
Brown
et
al.
,
1993
)
could
also
be
used
.
To
initialize
the
parameters
of
the
generative
model
for
the
first
iteration
,
we
use
bootstrapping
from
a
1-to-N
and
a
M-to-1
alignment
.
We
use
the
intersection
of
the
1-to-N
and
M-to-1
alignments
to
establish
the
head
word
relationship
,
the
1-to-N
alignment
to
delineate
the
target
word
cepts
,
and
the
M-to-1
alignment
to
delineate
the
source
word
cepts
.
In
bootstrapping
,
a
problem
arises
when
we
encounter
infeasible
alignment
structure
where
,
for
instance
,
a
source
word
generates
target
words
but
no
link
between
any
of
the
target
words
and
the
source
word
appears
in
the
intersection
,
so
it
is
not
clear
which
target
word
is
the
target
head
word
.
To
address
this
,
we
consider
each
of
the
N
generated
target
words
as
the
target
head
word
in
turn
and
assign
this
configuration
1
/
N
of
the
counts
.
For
each
iteration
of
training
we
search
for
the
Viterbi
solution
for
millions
of
sentences
.
Evidence
that
inference
over
the
space
of
all
possible
alignments
is
intractable
has
been
presented
,
for
a
similar
problem
,
in
(
Knight
,
1999
)
.
Unlike
phrase-based
SMT
,
left-to-right
hypothesis
extension
using
a
beam
decoder
is
unlikely
to
be
effective
because
in
word
alignment
reordering
is
not
limited
to
a
small
local
window
and
so
the
necessary
beam
would
be
very
large
.
We
are
not
aware
of
admissible
or
inadmissible
search
heuristics
which
have
been
shown
to
be
effective
when
used
in
conjunction
with
a
search
algorithm
similar
to
A
*
search
for
a
model
predicting
over
a
structure
like
ours
.
Therefore
we
use
a
simple
local
search
algorithm
which
operates
on
complete
hypotheses
.
(
Brown
et
al.
,
1993
)
defined
two
local
search
operations
for
their
1-to-N
alignment
models
3
,
4
and
5
.
All
alignments
which
are
reachable
via
these
operations
from
the
starting
alignment
are
considered
.
One
operation
is
to
change
the
generation
decision
for
a
French
word
to
a
different
English
word
(
move
)
,
and
the
other
is
to
swap
the
generation
decision
for
two
French
words
(
swap
)
.
All
possible
operations
are
tried
and
the
best
is
chosen
.
This
is
repeated
.
The
search
is
terminated
when
no
opera
-
tion
results
in
an
improvement
.
(
Och
and
Ney
,
2003
)
discussed
efficient
implementation
.
In
our
model
,
because
the
alignment
structure
is
richer
,
we
define
the
following
operations
:
move
French
non-head
word
to
new
head
,
move
English
non-head
word
to
new
head
,
swap
heads
of
two
French
non-head
words
,
swap
heads
of
two
English
non-head
words
,
swap
English
head
word
links
of
two
French
head
words
,
link
English
word
to
French
word
making
new
head
words
,
unlink
English
and
French
head
words
.
We
use
multiple
restarts
to
try
to
reduce
search
errors
.
(
Germann
et
al.
,
2004
;
Marcu
and
Wong
,
2002
)
have
some
similar
operations
without
the
head
word
distinction
.
3
Semi-supervised
parameter
estimation
Equation
6
defines
a
log-linear
model
.
Each
feature
function
hm
has
an
associated
weight
Am
.
Given
a
vector
of
these
weights
A
,
the
alignment
search
problem
,
i.e.
the
search
to
return
the
best
alignment
a
of
the
sentences
e
and
f
according
to
the
model
,
is
specified
by
Equation
7
.
We
decompose
the
new
generative
model
presented
in
Section
2
in
both
translation
directions
to
provide
the
initial
feature
functions
for
our
loglinear
model
,
features
1
to
10
and
16
to
25
in
Table
30
in
Table
1
)
.
We
use
the
semi-supervised
EMD
algorithm
(
Fraser
and
Marcu
,
2006b
)
to
train
the
model
.
The
initial
M-step
bootstraps
parameters
as
described
in
Section
2.2
from
a
M-to-1
and
a
1-to-N
alignment
.
We
then
perform
the
D-step
following
(
Fraser
and
Figure
2
:
Two
alignments
with
the
same
transla-tional
correspondence
Marcu
,
2006b
)
.
Given
the
feature
function
parameters
estimated
in
the
M-step
and
the
feature
function
weights
A
determined
in
the
D-step
,
the
E-step
searches
for
the
Viterbi
alignment
for
the
full
training
corpus
.
We
use
1
—
F-Measure
as
our
error
criterion
.
(
Fraser
and
Marcu
,
2006a
)
established
that
it
is
important
to
tune
a
(
the
trade-off
between
Precision
and
Recall
)
to
maximize
performance
.
In
working
with
LEAF
,
we
discovered
a
methodological
problem
with
our
baseline
systems
,
which
is
that
two
alignments
which
have
the
same
translational
correspondence
can
have
different
F-Measures
.
An
example
is
shown
in
Figure
2
.
To
overcome
this
problem
we
fully
interlinked
the
transitive
closure
of
the
undirected
bigraph
formed
by
each
alignment
hypothesized
by
our
baseline
alignment
systems1
.
This
operation
maps
the
alignment
shown
to
the
left
in
Figure
2
to
the
alignment
shown
to
the
right
.
This
operation
does
not
change
the
collection
of
phrases
or
rules
extracted
from
a
hypothesized
alignment
,
see
,
for
instance
,
(
Koehn
et
al.
,
2003
)
.
Working
with
this
fully
interlinked
representation
we
found
that
the
best
settings
of
a
were
a
=
0.1
for
the
Arabic
/
English
task
and
a
=
0.4
for
the
French
/
English
task
.
We
perform
experiments
on
two
large
alignments
tasks
,
for
Arabic
/
English
and
French
/
English
data
sets
.
Statistics
for
these
sets
are
shown
in
Table
2
.
All
of
the
data
used
is
available
from
the
Linguistic
Data
Consortium
except
for
the
French
/
English
1All
of
the
gold
standard
alignments
were
fully
interlinked
as
distributed
.
We
did
not
modify
the
gold
standard
alignments
.
t
(
/
j
\
ei
)
translation
without
dependency
on
word-type
t
(
/
j
\
ei
)
translation
table
from
final
HMM
iteration
s
(
i
&gt;
i
target
cept
size
without
dependency
on
so
(
ipo
\
Yl
i
number
of
unaligned
target
words
source
head
word
e
to
(
/
j
)
identity
of
unaligned
target
words
s
(
^i
\
ei
)
target
cept
size
without
dependency
on
Yi
target
spurious
word
penalty
(
same
features
,
other
direction
)
Table
1
:
Feature
functions
gold
standard
alignments
which
are
available
from
the
authors
.
To
build
all
alignment
systems
,
we
start
with
5
iterations
of
Model
1
followed
by
4
iterations
of
HMM
(
Vogel
et
al.
,
1996
)
,
as
implemented
in
GIZA++
(
Och
and
Ney
,
2003
)
.
For
all
non-LEAF
systems
,
we
take
the
best
performing
of
the
"
union
"
,
"
refined
"
and
"
intersection
"
symmetrization
heuristics
(
Och
and
Ney
,
2003
)
to
combine
the
1-to-N
and
M-to-1
directions
resulting
in
a
M-to-N
alignment
.
Because
these
systems
do
not
output
fully
linked
alignments
,
we
fully
link
the
resulting
alignments
as
described
at
the
end
of
Section
3
.
The
reader
should
recall
that
this
does
not
change
the
set
of
rules
or
phrases
that
can
be
extracted
using
the
alignment
.
We
perform
one
main
comparison
,
which
is
of
semi-supervised
systems
,
which
is
what
we
will
use
to
produce
alignments
for
SMT
.
We
compare
semi-supervised
LEAF
with
a
previous
state
of
the
art
semi-supervised
system
(
Fraser
and
Marcu
,
2006b
)
.
We
performed
translation
experiments
on
the
alignments
generated
using
semi-supervised
training
to
verify
that
the
improvements
in
F-Measure
result
in
increases
in
BLEU
.
We
also
compare
the
unsupervised
LEAF
system
with
GIZA++
Model
4
to
give
some
idea
of
the
performance
of
the
unsupervised
model
.
We
made
an
effort
to
optimize
the
free
parameters
of
GIZA++
,
while
for
unsupervised
LEAF
there
are
no
free
parameters
to
optimize
.
A
single
iteration
of
unsupervised
LEAF2
is
compared
with
heuristic
symmetrization
of
GIZA++
's
extension
of
Model
4
(
which
was
run
for
four
iterations
)
.
LEAF
was
bootstrapped
as
described
in
Section
2.2
from
the
HMM
Viterbi
alignments
.
Results
for
the
experiments
on
the
French
/
English
data
set
are
shown
in
Table
3
.
We
ran
GIZA++
for
four
iterations
of
Model
4
and
used
the
"
refined
"
heuristic
(
line
1
)
.
We
ran
the
baseline
semi-supervised
system
for
two
iterations
(
line
2
)
,
and
in
contrast
with
(
Fraser
and
Marcu
,
2006b
)
we
found
that
the
best
symmetrization
heuristic
for
this
system
was
"
union
"
,
which
is
most
likely
due
to
our
use
of
fully
linked
alignments
which
was
discussed
at
the
end
of
Section
3
.
We
observe
that
LEAF
unsupervised
(
line
3
)
is
competitive
with
GIZA++
(
line
1
)
,
and
is
in
fact
competitive
with
the
baseline
semi-supervised
result
(
line
2
)
.
We
ran
the
LEAF
semi-supervised
system
for
two
iterations
(
line
4
)
.
The
best
result
is
the
LEAF
semi-supervised
system
,
with
a
gain
of
1.8
F-Measure
over
the
LEAF
unsu-pervised
system
.
For
French
/
English
translation
we
use
a
state
of
the
art
phrase-based
MT
system
similar
to
(
Och
and
Ney
,
2004
;
Koehn
et
al.
,
2003
)
.
The
translation
test
data
is
described
in
Table
2
.
We
use
two
trigram
language
models
,
one
built
using
the
English
portion
of
the
training
data
and
the
other
built
using
additional
English
news
data
.
The
BLEU
scores
reported
in
this
work
are
calculated
using
lowercased
and
tok-enized
data
.
For
semi-supervised
LEAF
the
gain
of
0.46
BLEU
over
the
semi-supervised
baseline
is
not
statistically
significant
(
a
gain
of
0.78
BLEU
would
be
required
)
,
but
LEAF
semi-supervised
compared
with
GIZA++
is
significant
,
with
a
gain
of
1.23
BLEU
.
We
note
that
this
shows
a
large
gain
in
trans
-
while
setting
\
m
=
0
for
other
values
of
m.
Training
Singletons
Align
Discr
.
Align
Test
Words
Links
Dev
Trans
.
Test
Sents
Words
lation
quality
over
that
obtained
using
GIZA++
because
BLEU
is
calculated
using
only
a
single
reference
for
the
French
/
English
task
.
Results
for
the
Arabic
/
English
data
set
are
also
shown
in
Table
3
.
We
used
a
large
gold
standard
word
alignment
set
available
from
the
LDC
.
We
ran
GIZA++
for
four
iterations
of
Model
4
and
used
the
"
union
"
heuristic
.
We
compare
GIZA++
(
line
1
)
with
one
iteration
of
the
unsupervised
LEAF
model
(
line
2
)
.
The
unsupervised
LEAF
system
is
worse
than
four
iterations
of
GIZA++
Model
4
.
We
believe
that
the
features
in
LEAF
are
too
high
dimensional
to
use
for
the
Arabic
/
English
task
without
the
backoffs
available
in
the
semi-supervised
models
.
The
baseline
semi-supervised
system
(
line
3
)
was
run
for
three
iterations
and
the
resulting
alignments
were
combined
with
the
"
union
"
heuristic
.
We
ran
the
LEAF
semi-supervised
system
for
two
iterations
.
The
best
result
is
the
LEAF
semi-supervised
system
(
line
4
)
,
with
a
gain
of
5.4
F-Measure
over
the
baseline
semi-supervised
system
.
For
Arabic
/
English
translation
we
train
a
state
of
the
art
hierarchical
model
similar
to
(
Chiang
,
2005
)
using
our
Viterbi
alignments
.
The
translation
test
data
used
is
described
in
Table
2
.
We
use
two
tri-gram
language
models
,
one
built
using
the
English
portion
of
the
training
data
and
the
other
built
using
additional
English
news
data
.
The
test
set
is
from
the
NIST
2005
translation
task
.
LEAF
had
the
best
performance
scoring
1.43
BLEU
better
than
the
baseline
semi-supervised
system
,
which
is
statistically
significant
.
5
Previous
Work
The
LEAF
model
is
inspired
by
the
literature
on
generative
modeling
for
statistical
word
alignment
and
particularly
by
Model
4
(
Brown
et
al.
,
1993
)
.
Much
of
the
additional
work
on
generative
modeling
of
1-to-N
word
alignments
is
based
on
the
HMM
model
(
Vogel
et
al.
,
1996
)
.
(
Toutanova
et
al.
,
2002
)
and
(
Lopez
and
Resnik
,
2005
)
presented
a
variety
of
refinements
of
the
HMM
model
particularly
effective
for
low
data
conditions
.
(
Deng
and
Byrne
,
2005
)
described
work
on
extending
the
HMM
model
using
a
bigram
formulation
to
generate
1-to-N
alignment
structure
.
The
common
thread
connecting
these
works
is
their
reliance
on
the
1-to-N
approximation
,
while
we
have
defined
a
generative
model
which
does
not
require
use
of
this
approximation
,
at
the
cost
of
having
to
rely
on
local
search
.
There
has
also
been
work
on
generative
models
for
other
alignment
structures
.
(
Wang
and
Waibel
,
1998
)
introduced
a
generative
story
based
on
extension
of
the
generative
story
of
Model
4
.
The
alignment
structure
modeled
was
"
consecutive
M
to
non-consecutive
N
"
.
(
Marcu
and
Wong
,
2002
)
defined
the
Joint
model
,
which
modeled
consecutive
word
M-to-N
alignments
.
(
Matusov
et
al.
,
2004
)
presented
a
model
capable
of
modeling
1-toN
and
M-to-1
alignments
(
but
not
arbitrary
M-to-N
alignments
)
which
was
bootstrapped
from
Model
4
.
LEAF
directly
models
non-consecutive
M-to-N
alignments
.
One
important
aspect
of
LEAF
is
its
symmetry
.
(
Och
and
Ney
,
2003
)
invented
heuristic
symmetriza
-
French
/
English
Arabic
/
English
LEAF
unsupervised
LEAF
semi-supervised
Table
3
:
Experimental
Results
tion
of
the
output
of
a
1-to-N
model
and
a
M-to-1
model
resulting
in
a
M-to-N
alignment
,
this
was
extended
in
(
Koehn
et
al.
,
2003
)
.
We
have
used
insights
from
these
works
to
help
determine
the
structure
of
our
generative
model
.
(
Zens
et
al.
,
2004
)
introduced
a
model
featuring
a
symmetrized
lexicon
.
(
Liang
et
al.
,
2006
)
showed
how
to
train
two
HMM
models
,
a
1-to-N
model
and
a
M-to-1
model
,
to
agree
in
predicting
all
of
the
links
generated
,
resulting
in
a
1-to-1
alignment
with
occasional
rare
1-to-N
or
M-to-1
links
.
We
improve
on
these
works
by
choosing
a
new
structure
for
our
generative
model
,
the
head
word
link
structure
,
which
is
both
symmetric
and
a
robust
structure
for
modeling
of
non-consecutive
M-to-N
alignments
.
In
designing
LEAF
,
we
were
also
inspired
by
dependency-based
alignment
models
(
Wu
,
1997
;
Alshawi
et
al.
,
2000
;
Yamada
and
Knight
,
2001
;
Cherry
and
Lin
,
2003
;
Zhang
and
Gildea
,
2004
)
.
In
contrast
with
their
approaches
,
we
have
a
very
flat
,
one-level
notion
of
dependency
,
which
is
bilingually
motivated
and
learned
automatically
from
the
parallel
corpus
.
This
idea
of
dependency
has
some
similarity
with
hierarchical
SMT
models
such
as
(
Chiang
,
2005
)
.
The
discriminative
component
of
our
work
is
based
on
a
plethora
of
recent
literature
.
This
literature
generally
views
the
discriminative
modeling
problem
as
a
supervised
problem
involving
the
combination
of
heuristically
derived
feature
functions
.
These
feature
functions
generally
include
the
prediction
of
some
type
of
generative
model
,
such
as
the
HMM
model
or
Model
4
.
A
discriminatively
trained
1-to-N
model
with
feature
functions
specifically
designed
for
Arabic
was
presented
in
(
Ittycheriah
and
Roukos
,
2005
)
.
(
Lacoste-Julien
et
al.
,
2006
)
created
a
discriminative
model
able
to
model
1-to-1
,
1-to-2
and
2-to-1
alignments
for
which
the
best
results
were
obtained
using
features
based
on
symmetric
HMMs
trained
to
agree
,
(
Liang
et
al.
,
2006
)
,
and
intersected
Model
4
.
(
Ayan
and
Dorr
,
2006
)
defined
a
discriminative
model
which
learns
how
to
combine
the
predictions
of
several
alignment
algorithms
.
The
experiments
performed
included
Model
4
and
the
HMM
extensions
of
(
Lopez
and
Resnik
,
2005
)
.
(
Moore
et
al.
,
2006
)
introduced
a
discriminative
model
of
1-to-N
and
M-to-1
alignments
,
and
similarly
to
(
Lacoste-Julien
et
al.
,
2006
)
the
best
results
were
obtained
using
HMMs
trained
to
agree
and
intersected
Model
4
.
LEAF
is
not
bound
by
the
structural
restrictions
present
either
directly
in
these
models
,
or
in
the
features
derived
from
the
generative
models
used
.
We
also
iterate
the
generative
/
discriminative
process
,
which
allows
the
discriminative
predictions
to
influence
the
generative
model
.
our
work
is
most
similar
to
work
using
discriminative
log-linear
models
for
alignment
,
which
is
similar
to
discriminative
log-linear
models
used
for
the
SMT
decoding
(
translation
)
problem
(
Och
and
a
log-linear
model
combining
IBM
Model
3
trained
in
both
directions
with
heuristic
features
which
resulted
in
a
1-to-1
alignment
.
(
Fraser
and
Marcu
,
2006b
)
described
symmetrized
training
of
a
1-toN
log-linear
model
and
a
M-to-1
log-linear
model
.
These
models
took
advantage
of
features
derived
from
both
training
directions
,
similar
to
the
symmetrized
lexicons
of
(
Zens
et
al.
,
2004
)
,
including
features
derived
from
the
HMM
model
and
Model
4
.
However
,
despite
the
symmetric
lexicons
,
these
models
were
only
able
to
optimize
the
performance
of
the
1-to-N
model
and
the
M-to-1
model
separately
,
and
the
predictions
of
the
two
models
required
combination
with
symmetrization
heuristics
.
We
have
overcome
the
limitations
of
that
work
by
defining
new
feature
functions
,
based
on
the
LEAF
generative
model
,
which
score
non-consecutive
M-to-N
alignments
so
that
the
final
performance
criterion
can
be
optimized
directly
.
6
Conclusion
We
have
found
a
new
structure
over
which
we
can
robustly
predict
which
directly
models
translational
correspondence
commensurate
with
how
it
is
used
in
hierarchical
SMT
systems
.
Our
new
generative
model
,
LEAF
,
is
able
to
model
alignments
which
consist
of
M-to-N
non-consecutive
translational
correspondences
.
Unsupervised
LEAF
is
comparable
with
a
strong
baseline
.
When
coupled
with
a
discriminative
training
procedure
,
the
model
leads
to
increases
between
3
and
9
F-score
points
in
alignment
accuracy
and
1.2
and
2.8
BLEU
points
in
translation
accuracy
over
strong
French
/
English
and
Arabic
/
English
baselines
.
7
Acknowledgments
This
work
was
partially
supported
under
the
GALE
program
of
the
Defense
Advanced
Research
Projects
Agency
,
Contract
No.
HR0011-06-C-0022
.
We
would
like
to
thank
the
USC
Center
for
High
Performance
Computing
and
Communications
.
