We
compare
and
contrast
the
strengths
and
weaknesses
of
a
syntax-based
machine
translation
model
with
a
phrase-based
machine
translation
model
on
several
levels
.
We
briefly
describe
each
model
,
highlighting
points
where
they
differ
.
We
include
a
quantitative
comparison
of
the
phrase
pairs
that
each
model
has
to
work
with
,
as
well
as
the
reasons
why
some
phrase
pairs
are
not
learned
by
the
syntax-based
model
.
We
then
evaluate
proposed
improvements
to
the
syntax-based
extraction
techniques
in
light
of
phrase
pairs
captured
.
We
also
compare
the
translation
accuracy
for
all
variations
.
1
Introduction
String
models
are
popular
in
statistical
machine
translation
.
Approaches
include
word
substitution
systems
(
Brown
et
al.
,
1993
)
,
phrase
substitution
systems
(
Koehn
et
al.
,
2003
;
Och
and
Ney
,
2004
)
,
and
synchronous
context-free
grammar
systems
(
Wu
and
Wong
,
1998
;
Chiang
,
2005
)
,
all
of
which
train
on
string
pairs
and
seek
to
establish
connections
between
source
and
target
strings
.
By
contrast
,
explicit
syntax
approaches
seek
to
directly
model
the
relations
learned
from
parsed
data
,
including
models
between
source
trees
and
target
trees
(
Gildea
,
2003
;
Eisner
,
2003
;
Melamed
,
2004
;
Cowan
et
al.
,
2006
)
,
source
trees
and
target
strings
(
Quirk
et
al.
,
2005
;
Huang
et
al.
,
2006
)
,
or
source
strings
and
target
trees
(
Yamada
and
Knight
,
2001
;
Galley
et
al.
,
2004
)
.
It
is
unclear
which
of
these
important
pursuits
will
best
explain
human
translation
data
,
as
each
has
ad
-
vantages
and
disadvantages
.
A
strength
of
phrase
models
is
that
they
can
acquire
all
phrase
pairs
consistent
with
computed
word
alignments
,
snap
those
phrases
together
easily
by
concatenation
,
and
reorder
them
under
several
cost
models
.
An
advantage
of
syntax-based
models
is
that
outputs
tend
to
be
syntactically
well-formed
,
with
re-ordering
influenced
by
syntactic
context
and
function
words
introduced
to
serve
specific
syntactic
purposes
.
A
great
number
of
MT
models
have
been
recently
proposed
,
and
other
papers
have
gone
over
the
expressive
advantages
of
syntax-based
approaches
.
But
it
is
rare
to
see
an
in-depth
,
quantitative
study
of
strengths
and
weaknesses
of
particular
models
with
respect
to
each
other
.
This
is
important
for
a
scientific
understanding
of
how
these
models
work
in
practice
.
Our
main
novel
contribution
is
a
comparison
of
phrase-based
and
syntax-based
extraction
methods
and
phrase
pair
coverage
.
We
also
add
to
the
literature
a
new
method
of
improving
that
coverage
.
Additionally
,
we
do
a
careful
study
of
several
syntax-based
extraction
techniques
,
testing
whether
(
and
how
much
)
they
affect
phrase
pair
coverage
,
and
whether
(
and
how
much
)
they
affect
end-to-end
MT
accuracy
.
The
MT
accuracy
tests
are
needed
because
we
want
to
see
the
individual
effects
of
particular
techniques
under
the
same
testing
conditions
.
For
this
comparison
,
we
choose
a
previously
established
statistical
phrase-based
model
(
Och
and
Ney
,
2004
)
and
a
previously
established
statistical
string-to-tree
model
(
Galley
et
al.
,
2004
)
.
These
two
models
are
chosen
because
they
are
the
basis
of
two
of
the
most
successful
systems
in
the
NIST
2006
MT
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
755-763
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
evaluation1
.
2
Phrase-based
Extraction
The
Alignment
Template
system
(
ATS
)
described
by
Och
and
Ney
(
2004
)
is
representative
of
statistical
phrase-based
models
.
The
basic
unit
of
translation
is
the
phrase
pair
,
which
consists
of
a
sequence
of
words
in
the
source
language
,
a
sequence
of
words
in
the
target
language
,
and
a
vector
of
feature
values
which
describe
this
pair
's
likelihood
.
Decoding
produces
a
string
in
the
target
language
,
in
order
,
from
beginning
to
end
.
During
decoding
,
features
from
each
phrase
pair
are
combined
with
other
features
(
e.g.
,
re-ordering
,
language
models
)
using
a
log-linear
model
to
compute
the
score
of
the
entire
translation
.
The
ATS
phrase
extraction
algorithm
learns
these
phrase
pairs
from
an
aligned
,
parallel
corpus
.
This
corpus
is
conceptually
a
list
of
tuples
of
&lt;
source
sentence
,
target
sentence
,
bi-directional
word
alignments
&gt;
which
serve
as
training
examples
,
one
of
which
is
shown
in
Figure
1
.
i
felt
obliged
to
do
my
part
.
Figure
1
:
a
phrase-based
training
example
For
each
training
example
,
the
algorithm
identifies
and
extracts
all
pairs
of
&lt;
source
sequence
,
target
sequence
&gt;
that
are
consistent
with
the
alignments
.
It
does
this
by
first
enumerating
all
source-side
word
sequences
up
to
a
length
limit
L
,
and
for
each
source
sequence
,
it
identifies
all
target
words
aligned
to
those
source
words
.
For
example
,
in
Figure
1
,
for
the
source
phrase
,
the
target
words
it
aligns
to
are
felt
,
obliged
,
and
do
.
These
words
,
and
all
those
between
them
,
are
the
proposed
target
phrase
.
If
no
words
in
the
proposed
target
phrase
align
to
words
outside
of
the
source
phrase
,
then
this
phrase
pair
is
extracted
.
The
extraction
algorithm
can
also
look
to
the
left
and
right
of
the
proposed
target
phrase
for
neighboring
unaligned
words
and
extracts
phrases
.
For
example
,
for
the
phrase
pair
W
SHi
&lt;
-
+
felt
obliged
,
the
word
to
is
a
neighboring
unaligned
word
.
It
constructs
new
target
phrases
by
adding
on
consecutive
unaligned
words
in
both
directions
,
and
extracts
those
in
new
pairs
,
too
(
e.g.
,
W
flt
tt
&lt;
-
+
felt
obliged
to
)
.
For
efficiency
reasons
,
implementations
often
skip
this
step
.
Figure
2
shows
the
complete
set
of
phrase
pairs
up
to
length
4
that
are
extracted
from
the
Figure
1
training
example
.
Notice
that
no
extracted
phrase
pair
contains
the
character
Because
of
the
alignments
,
the
smallest
legal
phrase
pair
,
i
felt
obliged
to
do
my
,
is
beyond
the
size
limit
of
4
,
so
it
is
not
extracted
in
this
example
.
felt
obliged
felt
obliged
to
do
obliged
to
do
part
.
Figure
2
:
phrases
up
to
length
4
extracted
from
the
example
in
Figure
1
Phrase
pairs
are
extracted
over
the
entire
training
corpus
.
Due
to
differing
alignments
,
some
phrase
pairs
that
cannot
be
learned
from
one
example
may
be
learned
from
another
.
These
pairs
are
then
counted
,
once
for
each
time
they
are
seen
in
a
training
example
,
and
these
counts
are
used
as
the
basis
for
maximum
likelihood
probability
features
,
such
as
p
(
f
|
e
)
and
p
(
e
|
f
)
.
3
Syntax-based
Extraction
The
GHKM
syntax-based
extraction
method
for
learning
statistical
syntax-based
translation
rules
,
presented
first
in
(
Galley
et
al.
,
2004
)
and
expanded
on
in
(
Galley
et
al.
,
2006
)
,
is
similar
to
phrase-based
extraction
in
that
it
extracts
rules
consistent
with
given
word
alignments
.
A
primary
difference
is
the
use
of
syntax
trees
on
the
target
side
,
rather
than
sequences
of
words
.
The
basic
unit
of
translation
is
the
translation
rule
,
consisting
of
a
sequence
of
words
and
variables
in
the
source
language
,
a
syntax
tree
in
the
target
language
having
words
or
variables
at
the
leaves
,
and
again
a
vector
of
feature
values
which
describe
this
pair
's
likelihood
.
Translation
rules
can
:
•
look
like
phrase
pairs
with
syntax
decoration
:
NPB
(
NNP
(
prime
)
NNP
(
minister
)
NNP
(
keizo
)
NNP
(
obuchi
)
)
•
carry
extra
contextual
constraints
:
(
according
to
this
rule
,
can
translate
to
said
only
if
some
Chinese
sequence
to
the
right
of
is
translated
into
an
SBAR-C
)
•
be
non-constituent
phrases
:
VP
(
VBD
(
pointed
)
•
contain
non-contiguous
phrases
,
effectively
"
phrases
with
holes
"
:
•
be
purely
structural
(
no
words
)
•
re-order
their
children
:
Decoding
with
this
model
produces
a
tree
in
the
target
language
,
bottom-up
,
by
parsing
the
foreign
string
using
a
CYK
parser
and
a
binarized
rule
set
(
Zhang
et
al.
,
2006
)
.
During
decoding
,
features
from
each
translation
rule
are
combined
with
a
language
model
using
a
log-linear
model
to
compute
the
score
of
the
entire
translation
.
The
GHKM
extractor
learns
translation
rules
from
an
aligned
parallel
corpus
where
the
target
side
has
been
parsed
.
This
corpus
is
conceptually
a
list
of
tuples
of
&lt;
source
sentence
,
target
tree
,
bi-directional
word
alignments
&gt;
which
serve
as
training
examples
,
one
of
which
is
shown
in
Figure
3
.
Figure
3
:
a
syntax-based
training
example
For
each
training
example
,
the
GHKM
extractor
computes
the
set
of
minimally-sized
translation
rules
that
can
explain
the
training
example
while
remaining
consistent
with
the
alignments
.
This
is
,
in
effect
,
a
non-overlapping
tiling
of
translation
rules
over
the
tree-string
pair
.
If
there
are
no
unaligned
words
in
the
source
sentence
,
this
is
a
unique
set
.
This
set
,
ordered
into
a
tree
of
rule
applications
,
is
called
the
derivation
tree
of
the
training
example
.
Unlike
the
ATS
model
,
there
are
no
inherent
size
limits
,
just
the
constraint
that
the
rules
be
as
small
as
possible
for
the
example
.
Ignoring
the
unaligned
for
the
moment
,
there
are
seven
minimal
translation
rules
that
are
extracted
from
the
example
in
Figure
3
,
as
shown
in
Figure
4
.
Notice
that
rule
6
is
rather
large
and
applies
to
a
very
limited
syntactic
context
.
The
only
constituent
node
that
covers
both
i
and
my
is
the
S
,
so
the
rule
rooted
at
S
is
extracted
,
with
variables
for
every
branch
below
this
top
constituent
that
can
be
explained
by
other
rules
.
Note
also
that
to
be
-
comes
a
part
of
this
rule
naturally
.
If
the
alignments
were
not
as
constraining
(
e.g.
,
if
my
was
unaligned
)
,
then
instead
of
this
one
big
rule
many
smaller
rules
would
be
extracted
,
such
as
structural
rules
(
e.g.
,
vp
(
xo
:
vbd
xi
:
VP-C
)
xo
xi
)
and
function
word
insertion
rules
(
e.g.
,
VP
(
TO
(
to
)
xo
:
VP-C
)
xo
)
.
PERIOD
(
.
)
&lt;
-
&gt;
.
Figure
4
:
rules
extracted
from
training
example
We
ignored
unaligned
source
words
in
the
example
above
.
Galley
et
al.
(
2004
)
attach
the
unaligned
source
word
to
the
highest
possible
location
,
in
our
example
,
the
S.
Thus
it
is
extracted
along
with
our
large
rule
6
,
changing
the
target
language
sequence
to
"
8
;
x0
x
\
x2
x
%
~
)
]
x4
"
.
This
treatment
still
results
in
a
unique
derivation
tree
no
matter
how
many
unaligned
words
are
present
.
In
Galley
et
al.
(
2006
)
,
instead
of
a
unique
derivation
tree
,
the
extractor
computes
several
derivation
trees
,
each
with
the
unaligned
word
added
to
a
different
rule
such
that
the
data
is
still
explained
.
For
example
,
for
the
tree-string
pair
in
Figure
3
,
could
be
added
not
only
to
rule
6
,
but
alternatively
to
rule
4
or
5
,
to
make
the
new
rules
:
NN
(
part
)
^
PERIOD
(
.
)
tl
.
This
results
in
three
different
derivations
,
one
with
the
character
in
rule
4
(
with
rules
5
and
6
as
originally
shown
)
,
another
with
the
character
in
rule
5
(
with
rules
4
and
6
as
originally
shown
)
,
and
lastly
one
with
the
character
in
rule
6
(
with
rules
4
and
5
as
originally
shown
)
as
in
the
original
paper
(
Galley
et
al.
,
2004
)
.
In
total
,
ten
different
rules
are
extracted
from
this
training
example
.
As
with
ATS
,
translation
rules
are
extracted
and
counted
over
the
entire
training
corpus
,
a
count
of
one
for
each
time
they
appear
in
a
training
example
.
These
counts
are
used
to
estimate
several
features
,
including
maximum
likelihood
probability
features
for
P
(
etreei
fwords
\
ehead
)
,
P
(
ewords
\
fwords
)
,
and
P
(
fwords
\
ew
&lt;
ords
)
.
4
Differences
in
Phrasal
Coverage
Both
the
ATS
model
and
the
GHKM
model
extract
linguistic
knowledge
from
parallel
corpora
,
but
each
has
fundamentally
different
constraints
and
assumptions
.
To
compare
the
models
empirically
,
we
extracted
phrase
pairs
(
for
the
ATS
model
)
and
translation
rules
(
for
the
GHKM
model
)
from
parallel
training
corpora
described
in
Table
1
.
The
ATS
model
was
limited
to
phrases
of
length
10
on
the
source
side
,
and
length
20
on
the
target
side
.
A
superset
of
the
parallel
data
was
word
aligned
by
GIZA
union
(
Och
and
Ney
,
2003
)
and
EMD
(
Fraser
and
Marcu
,
2006
)
.
The
English
side
of
training
data
was
parsed
using
an
implementation
of
Collins
'
model
2
(
Collins
,
2003
)
.
Document
IDs
#
of
segments
#
of
words
in
foreign
corpus
#
of
words
in
English
corpus
Table
1
:
parallel
corpora
used
to
train
both
models
Table
2
shows
the
total
number
of
GHKM
rules
extracted
,
and
a
breakdown
of
the
different
kinds
of
rules
.
Non-lexical
rules
are
those
whose
source
side
is
composed
entirely
of
variables
—
there
are
no
source
words
in
them
.
Because
of
this
,
they
potentially
apply
to
any
sentence
.
Lexical
rules
(
their
counterpart
)
far
outnumber
non-lexical
rules
.
Of
the
lexical
rules
,
a
rule
is
considered
a
phrasal
rule
if
its
source
side
and
the
yield
of
its
target
side
contain
exactly
one
contiguous
phrase
each
,
optionally
with
one
or
more
variables
on
either
side
of
the
phrase
.
Non-phrasal
rules
include
structural
rules
,
re-ordering
rules
,
and
non-contiguous
phrases
.
These
rules
are
not
easy
to
directly
compare
to
any
phrase
pairs
from
the
ATS
model
,
so
we
do
not
focus
on
them
here
.
Phrasal
rules
can
be
directly
compared
to
ATS
phrase
pairs
,
the
easiest
way
being
to
discard
the
Statistic
total
translation
rules
non-lexical
rules
lexical
rules
phrasal
rules
distinct
GHKM-derived
phrase
pairs
distinct
corpus-specific
GHKM-derived
phrase
pairs
Table
2
:
a
breakdown
of
how
many
rules
the
GHKM
extraction
algorithm
produces
,
and
how
many
phrase
pairs
can
be
derived
from
them
syntactic
context
and
look
at
the
phrases
contained
in
the
rules
.
The
second
to
last
line
of
Table
2
shows
the
number
of
phrase
pairs
that
can
be
derived
from
the
above
phrasal
rules
.
The
number
of
GHKM-derived
phrase
pairs
is
lower
than
the
number
of
phrasal
rules
because
some
rules
represent
the
same
phrasal
translation
,
but
with
different
syntactic
contexts
.
The
last
line
of
Table
2
shows
the
subset
of
phrase
pairs
that
contain
source
phrases
found
in
our
development
corpus
.
Table
3
compares
these
corpus-specific
GHKM-derived
phrase
pairs
with
the
corpus-specific
ATS
phrase
pairs
.
Note
that
the
number
of
phrase
pairs
derived
from
the
GHKM
rules
is
less
than
the
number
of
phrase
pairs
extracted
by
ATS
.
Moreover
,
only
slightly
over
half
of
the
phrase
pairs
extracted
by
the
ATS
model
are
common
to
both
models
.
The
limits
and
constraints
of
each
model
are
responsible
for
this
difference
in
contiguous
phrases
learned
.
Source
of
phrase
pairs
GHKM-derived
Overlap
between
models
GHKM
only
ATS
only
ATS-useful
only
Table
3
:
comparison
of
corpus-specific
phrase
pairs
from
each
model
GHKM
learns
some
contiguous
phrase
pairs
that
the
phrase-based
extractor
does
not
.
Only
a
small
portion
of
these
are
due
to
the
fact
that
the
GHKM
model
has
no
inherent
size
limit
,
while
the
phrase
based
system
has
limits
.
More
numerous
are
cases
where
unaligned
English
words
are
not
added
to
an
ATS
phrase
pair
while
GHKM
adopts
them
at
a
syn
-
tactically
motivated
location
,
or
where
a
larger
rule
contains
mostly
syntactic
structure
but
happens
to
have
some
unaligned
words
in
it
.
For
example
,
consider
Figure
5
.
Because
basic
and
will
are
unaligned
,
ATS
will
learn
no
phrase
pairs
that
translate
to
these
words
alone
,
though
they
will
be
learned
as
a
part
of
larger
phrases
.
This
basic
relationship
will
not
change
Figure
5
:
Situation
where
GHKM
is
able
to
learn
rules
that
translate
into
basic
and
will
,
but
ATS
is
not
GHKM
,
however
,
will
learn
several
phrasal
rules
that
translate
to
basic
,
based
on
the
syntactic
context
and
one
phrasal
rule
that
translates
into
will
The
quality
of
such
phrases
may
vary
.
For
example
,
the
first
translation
of
—
■
(
literally
:
"
one
"
or
"
a
"
)
to
basic
above
is
a
phrase
pair
of
poor
quality
,
while
the
other
two
for
basic
and
one
for
will
are
arguably
reasonable
.
However
,
Table
3
shows
that
ATS
was
able
to
learn
many
more
phrase
pairs
that
GHKM
was
not
.
Even
more
significant
is
the
subset
of
these
missing
phrase
pairs
that
the
ATS
decoder
used
in
its
best2
i.e.
highest
scoring
translation
of
the
corpus
.
According
to
the
phrase-based
system
these
are
the
most
"
useful
"
phrase
pairs
and
GHKM
could
not
learn
them
.
Since
this
is
a
clear
deficiency
,
we
will
focus
on
analyzing
these
phrase
pairs
(
which
we
call
ATS-useful
)
and
the
reasons
they
were
not
learned
.
Table
4
shows
a
breakdown
,
categorizing
each
of
these
missing
ATS-useful
phrase
pairs
and
the
reasons
they
were
not
able
to
be
learned
.
The
most
common
reason
is
straightforward
:
by
extracting
only
the
minimally-sized
rules
,
GHKM
is
unable
to
learn
many
larger
phrases
that
ATS
learns
.
If
GHKM
can
make
a
word-level
analysis
,
it
will
do
that
,
at
the
expense
of
a
phrase-level
analysis
.
Galley
et
al.
(
2006
)
propose
one
solution
to
this
problem
and
Marcu
et
al.
(
2006
)
propose
another
,
both
of
which
we
explore
in
Sections
5.1
and
5.2
.
Not
minimal
Table
4
:
reasons
that
ATS-useful
phrase
pairs
could
not
be
extracted
by
GHKM
as
phrasal
rules
The
second
reason
is
that
the
GHKM
model
is
sometimes
forced
by
its
syntactic
constraints
to
include
extra
words
.
Sometimes
this
is
only
target
language
words
,
and
this
is
often
useful
—
the
rules
are
learning
to
insert
these
words
in
their
proper
context
.
But
most
of
the
time
,
source
language
words
are
also
forced
to
be
part
of
the
rule
,
and
this
is
harmful
—
it
makes
the
rules
less
general
.
This
latter
case
is
often
due
to
poorly
aligned
target
language
words
(
such
as
the
$
c
in
our
Section
3
rule
extraction
example
)
,
or
unaligned
words
under
large
,
flat
constituents
.
Another
factor
here
:
some
of
the
phrase
pairs
are
learned
by
both
systems
,
but
GHKM
is
more
specific
about
the
context
of
use
.
This
can
be
both
a
strength
and
a
weakness
.
It
is
a
strength
when
the
syntactic
context
helps
the
phrase
to
be
used
in
a
syntactically
correct
way
,
as
in
where
the
syntax
rule
requires
a
constituent
of
type
SBAR-C
.
Conversely
its
weakness
is
seen
when
the
context
is
too
constrained
.
For
example
,
ATS
can
easily
learn
the
phrase
niS
prime
minister
and
is
then
free
to
use
it
in
many
contexts
.
But
GHKM
learns
45
different
rules
,
each
that
translate
this
phrase
pair
in
a
unique
context
.
Figure
6
shows
a
sampling
.
Notice
that
though
many
variations
are
present
,
the
decoder
is
unable
to
use
any
of
these
rules
to
produce
certain
noun
phrases
,
such
as
"
current
Japanese
Prime
Minister
Shinzo
Abe
"
,
because
no
rule
has
the
proper
number
of
English
modifiers
.
Figure
6
:
a
sampling
of
the
45
rules
that
translate
^Sto
prime
minister
5
Coverage
Improvements
Each
of
the
models
presented
so
far
has
advantages
and
disadvantages
.
In
this
section
,
we
consider
ideas
that
make
up
for
deficiencies
in
the
GHKM
model
,
drawing
our
inspiration
from
the
strong
points
ofthe
ATS
model
.
We
then
measure
the
effects
of
each
idea
empirically
,
showing
both
what
is
gained
and
the
potential
limits
of
each
modification
.
Galley
et
al.
(
2006
)
proposed
the
idea
of
composed
rules
.
This
removes
the
minimality
constraint
required
earlier
:
any
two
or
more
rules
in
a
parent-child
relationship
in
the
derivation
tree
can
be
combined
to
form
a
larger
,
composed
rule
.
This
change
is
similar
in
spirit
to
the
move
from
word-based
to
phrase-based
MT
models
,
or
parsing
with
a
DOP
model
(
Bod
et
al.
,
2003
)
rather
than
a
plain
PCFG
.
Because
this
results
in
exponential
variations
,
a
size
limit
is
employed
:
for
any
two
or
more
rules
to
be
allowed
to
combine
,
the
size
of
the
resulting
rule
must
be
at
most
n.
The
size
of
a
rule
is
defined
as
the
number
of
non-part-of-speech
,
non-leaf
constituent
labels
in
a
rule
's
target
tree
.
For
example
,
rules
1-5
shown
in
Section
3
have
a
size
of
0
,
and
rule
6
has
a
size
of
10
.
Composed
rules
are
extracted
in
addition
to
minimal
rules
,
which
means
that
a
larger
n
limit
always
results
in
a
superset
of
the
rules
extracted
when
a
smaller
n
value
is
used
.
When
n
is
set
to
0
,
then
only
minimal
rules
are
extracted
.
Table
5
shows
the
growth
in
the
number
of
rules
extracted
for
several
size
limits
.
Table
5
:
increasing
the
size
limit
of
composed
rules
significantly
increases
the
number
of
rules
extracted
In
our
previous
analysis
,
the
main
reason
that
GHKM
did
not
learn
translations
for
ATS-useful
phrase
pairs
was
due
to
its
minimal-only
approach
.
Table
6
shows
the
effect
that
composed
rule
extraction
has
on
the
total
number
of
ATS-useful
phrases
missing
.
Note
that
as
the
allowed
size
of
composed
rule
increases
,
we
are
able
to
extract
an
greater
percentage
of
the
missing
ATS-useful
phrase
pairs
.
Table
6
:
number
ofATS-useful
phrases
still
missing
when
using
GHKM
composed
rule
extraction
Unfortunately
,
a
comparison
of
Tables
5
and
6
indicates
that
the
number
of
ATS-useful
phrase
pairs
gained
is
growing
at
a
much
slower
rate
than
the
total
number
of
rules
.
From
a
practical
standpoint
,
more
rules
means
more
processing
work
and
longer
decoding
times
,
so
there
are
diminishing
returns
from
continuing
to
explore
larger
size
limits
.
An
alternative
for
extracting
larger
rules
called
SPMT
model
1
is
presented
by
Marcu
et
al.
(
2006
)
.
Though
originally
presented
as
a
separate
model
,
the
method
of
rule
extraction
itself
builds
upon
the
minimal
GHKM
method
just
as
composed
rules
do
.
For
each
training
example
,
the
method
considers
all
source
language
phrases
up
to
length
L.
For
each
of
these
phrases
,
it
extracts
the
smallest
possible
syntax
rule
that
does
not
violate
the
alignments
.
Table
7
shows
that
this
method
is
able
to
extract
rules
that
cover
useful
phrases
,
and
can
be
combined
with
size
4
composed
rules
to
an
even
better
effect
.
Since
there
is
some
overlap
in
these
methods
,
when
combining
the
two
methods
we
eliminate
any
redundant
rules
.
SPMT
model
1
alone
Table
7
:
ATS-useful
phrases
still
missing
after
different
non-minimal
methods
are
applied
Note
that
having
more
phrasal
rules
is
not
the
only
advantage
ofcomposed
rules
.
Here
,
combining
both
composed
and
SPMT
model
1
rules
,
ourgain
in
useful
phrases
is
not
very
large
,
but
we
do
gain
additional
,
larger
syntax
rules
.
As
discussed
in
(
Galley
et
al.
,
2006
)
,
composed
rules
also
allow
the
learning
of
more
context
,
such
as
This
rule
is
not
learned
by
SPMT
model
1
because
it
is
not
the
smallest
rule
that
can
explain
the
phrase
pair
,
but
it
is
still
valuable
for
its
syntactic
context
.
5.3
Restructuring
Trees
Table
8
updates
the
causes
of
missing
ATS-useful
phrase
pairs
.
Most
are
now
caused
by
syntactic
constraints
,
thus
we
need
to
address
these
in
some
way
.
GHKM
translation
rules
are
affected
by
large
,
flat
constituents
in
syntax
trees
,
as
in
the
prime
minister
example
earlier
.
One
way
to
soften
this
constraint
is
to
binarize
the
trees
,
so
that
wide
constituents
are
broken
down
into
multiple
levels
oftree
structure
.
The
approach
we
take
here
is
head-out
bi-narization
(
Wang
et
al.
,
2007
)
,
where
any
constituent
with
more
than
two
children
is
split
into
partial
constituents
.
The
children
to
the
left
of
the
head
word
Category
of
ATS-useful
phrase
pairs
Table
8
:
reasons
that
ATS-useful
phrase
pairs
are
still
not
extracted
as
phrasal
rules
,
with
composed
and
SPMT
model
1
rules
in
place
are
binarized
one
direction
,
while
the
children
to
the
right
are
binarized
the
other
direction
.
The
top
node
retains
its
original
label
(
e.g.
NPB
)
,
while
the
new
partial
constituents
are
labeled
with
a
bar
(
e.g.
NPB
)
.
Figure
7
shows
an
example
.
Figure
7
:
head-out
binarization
in
the
target
language
:
S
,
NPB
,
and
VP
are
binarized
according
to
the
head
word
Table
9
shows
the
effect
of
binarization
on
phrasal
coverage
,
using
both
composed
and
SPMT
rules
.
By
eliminating
some
of
the
syntactic
constraints
we
allow
more
freedom
,
which
allows
increased
phrasal
coverage
,
but
generates
more
rules
.
Category
of
missing
ATS-useful
phrase
pairs
Too
large
Extra
target
words
in
GHKM
rules
Extra
source
words
in
GHKM
rules
Total
missing
useful
phrase
pairs
Table
9
:
reasons
that
ATS-useful
phrase
pairs
still
could
not
be
extracted
as
phrasal
rules
after
binarization
6
Evaluation
of
Translations
To
evaluate
translation
quality
of
each
of
these
models
and
methods
,
we
ran
the
ATS
decoder
using
its
extracted
phrase
pairs
and
the
syntax-based
decoder
using
all
the
rule
sets
mentioned
above
.
Table
10
describes
the
development
and
test
datasets
used
,
along
with
four
references
for
measuring
BLEU
.
Tuning
was
done
using
Maximum
BLEU
hill-climbing
(
Och
,
2003
)
.
Features
used
for
the
ATS
system
were
the
standard
set
.
For
the
syntax-based
translation
system
,
we
used
a
similar
set
of
features
.
Chinese
Arabic
Development
set
Test
set
Table
10
:
development
and
test
corpora
Table
11
shows
the
case-insensitive
NIST
BLEU4
scores
for
both
our
development
and
test
decodings
.
The
BLEU
scores
indicate
,
first
of
all
,
that
the
syntax-based
system
is
much
stronger
in
translating
Chinese
than
Arabic
,
in
comparison
to
the
phrase-based
system
.
Also
,
the
ideas
presented
here
for
improving
phrasal
coverage
generally
improve
the
syntax-based
translation
quality
.
In
addition
,
composed
rules
are
shown
to
be
helpful
as
compared
to
the
minimal
runs
.
This
is
true
even
when
SPMT
model
1
is
added
,
which
indicates
that
the
size
4
composed
rules
bring
more
than
just
improved
phrasal
coverage
.
Experiment
Chinese
Dev
Test
Arabic
Dev
Test
Baseline
ATS
Baseline
GHKM
(
minimal
only
)
GHKM
composed
size
2
GHKM
composed
size
3
GHKM
composed
size
4
With
binarization
Table
11
:
evaluation
results
(
reported
in
case
-
insensitive
NIST
BLEU4
)
7
Conclusions
Both
the
ATS
model
for
phrase-based
machine
translation
and
the
GHKM
model
for
syntax-based
machine
translation
are
state-of-the-art
methods
.
Each
extraction
method
has
strengths
and
weaknesses
as
compared
to
the
other
,
and
there
are
surprising
differences
in
phrasal
coverage
—
neither
is
merely
a
superset
of
the
other
.
We
have
shown
that
it
is
possible
to
gain
insights
from
the
strengths
of
the
phrase-based
extraction
model
to
increase
both
the
phrasal
coverage
and
translation
accuracy
of
the
syntax-based
model
.
However
,
there
is
still
room
for
improvement
in
both
models
.
For
syntax
models
,
there
are
still
holes
in
phrasal
coverage
,
and
other
areas
are
needing
progress
,
such
as
decoding
efficiency
.
For
phrase-based
models
,
incorporating
syntactic
knowledge
and
constraints
may
lead
to
improvements
as
well
.
8
Acknowledgments
The
authors
wish
to
acknowledge
our
colleagues
at
ISI
,
especially
David
Chiang
,
for
constructive
criticism
on
an
early
draft
of
this
document
,
and
several
reviewers
for
their
detailed
comments
which
helped
us
make
the
paper
stronger
.
We
are
also
grateful
to
Jens-Sonke
Vockler
for
his
assistance
in
setting
up
an
experimental
pipeline
,
without
which
this
work
would
have
been
much
more
tedious
and
difficult
.
This
research
was
supported
under
DARPA
Contract
No.
HR0011-06-C-0022
.
