Among
syntax-based
translation
models
,
the
tree-based
approach
,
which
takes
as
input
a
parse
tree
of
the
source
sentence
,
is
a
promising
direction
being
faster
and
simpler
than
its
string-based
counterpart
.
However
,
current
tree-based
systems
suffer
from
a
major
drawback
:
they
only
use
the
1-best
parse
to
direct
the
translation
,
which
potentially
introduces
translation
mistakes
due
to
parsing
errors
.
We
propose
a
forest-based
approach
that
translates
a
packed
forest
of
exponentially
many
parses
,
which
encodes
many
more
alternatives
than
standard
n-best
lists
.
Large-scale
experiments
show
an
absolute
improvement
of
1.7
BLEU
points
over
the
1-best
baseline
.
This
result
is
also
0.8
points
higher
than
decoding
with
30-best
parses
,
and
takes
even
less
time
.
1
Introduction
time
,
see
(
Huang
et
al.
,
2006
)
)
,
do
not
require
a
binary-branching
grammar
as
in
string-based
models
(
Zhang
et
al.
,
2006
)
,
and
can
have
separate
grammars
for
parsing
and
translation
,
say
,
a
context-free
grammar
for
the
former
and
a
tree
substitution
grammar
for
the
latter
(
Huang
et
al.
,
2006
)
.
However
,
despite
these
advantages
,
current
tree-based
systems
suffer
from
a
major
drawback
:
they
only
use
the
1-best
parse
tree
to
direct
the
translation
,
which
potentially
introduces
translation
mistakes
due
to
parsing
errors
(
Quirk
and
Corston-Oliver
,
2006
)
.
This
situation
becomes
worse
with
resource-poor
source
languages
without
enough
Treebank
data
to
train
a
high-accuracy
parser
.
One
obvious
solution
to
this
problem
is
to
take
as
input
k-best
parses
,
instead
of
a
single
tree
.
This
k-best
list
postpones
some
disambiguation
to
the
decoder
,
which
may
recover
from
parsing
errors
by
getting
a
better
translation
from
a
non
1-best
parse
.
However
,
a
k-best
list
,
with
its
limited
scope
,
often
has
too
few
variations
and
too
many
redundancies
;
for
example
,
a
50-best
list
typically
encodes
a
combination
of
5
or
6
binary
ambiguities
(
since
25
&lt;
50
&lt;
26
)
,
and
many
subtrees
are
repeated
across
different
parses
(
Huang
,
2008
)
.
It
is
thus
inefficient
either
to
decode
separately
with
each
of
these
very
similar
trees
.
Longer
sentences
will
also
aggravate
this
situation
as
the
number
of
parses
grows
exponentially
with
the
sentence
length
.
We
instead
propose
a
new
approach
,
forest-based
translation
(
Section
3
)
,
where
the
decoder
translates
a
packed
forest
of
exponentially
many
parses,1
1
There
has
been
some
confusion
in
the
MT
literature
regarding
the
term
forest
:
the
word
"
forest
"
in
"
forest-to-string
rules
"
yü
jüxing
le
Figure
1
:
An
example
translation
rule
(
r3
in
Fig
.
2
)
.
which
compactly
encodes
many
more
alternatives
than
k-best
parses
.
This
scheme
can
be
seen
as
a
compromise
between
the
string-based
and
tree-based
methods
,
while
combining
the
advantages
of
both
:
decoding
is
still
fast
,
yet
does
not
commit
to
a
single
parse
.
Large-scale
experiments
(
Section
4
)
show
an
improvement
of
1.7
BLEU
points
over
the
1-best
baseline
,
which
is
also
0.8
points
higher
than
decoding
with
30-best
trees
,
and
takes
even
less
time
thanks
to
the
sharing
of
common
subtrees
.
2
Tree-based
systems
Current
tree-based
systems
perform
translation
in
two
separate
steps
:
parsing
and
decoding
.
A
parser
first
parses
the
source
language
input
into
a
1-best
tree
T
,
and
the
decoder
then
searches
for
the
best
derivation
(
a
sequence
of
translation
steps
)
d
*
that
converts
source
tree
T
into
a
target-language
string
among
all
possible
derivations
D
:
d
*
=
argmaxP
(
dlT
)
.
We
will
now
proceed
with
a
running
example
translating
from
Chinese
to
English
:
Bushi
yu
Shaiong
juxing
le
Bush
with
/
and
Sharon
i
hold
pass
.
hüitän
talk2
Figure
2
shows
how
this
process
works
.
The
Chinese
sentence
(
a
)
is
first
parsed
into
tree
(
b
)
,
which
will
be
converted
into
an
English
string
in
5
steps
.
First
,
at
the
root
node
,
we
apply
rule
r1
preserving
top-level
word-order
between
English
and
Chinese
,
(
Liu
et
al.
,
2007
)
was
a
misnomer
which
actually
refers
to
a
set
of
several
unrelated
subtrees
over
disjoint
spans
,
and
should
not
be
confused
with
the
standard
concept
of
packedforest
.
yü
NR
jüxing
le
I
Shälong
ri
JJ
Shälong^
Figure
2
:
An
example
derivation
of
tree-to-string
translation
.
Shaded
regions
denote
parts
of
the
tree
that
is
pattern-matched
with
the
rule
being
applied
.
which
results
in
two
unfinished
subtrees
in
(
c
)
.
Then
rule
r2
grabs
the
Bushi
subtree
and
transliterate
it
(
r2
)
NPB
(
NR
(
Bushi
)
)
Bush
.
Similarly
,
rule
r3
shown
in
Figure
1
is
applied
to
the
VP
subtree
,
which
swaps
the
two
NPBs
,
yielding
the
situation
in
(
d
)
.
This
rule
is
particularly
interesting
since
it
has
multiple
levels
on
the
source
side
,
which
has
more
expressive
power
than
synchronous
context-free
grammars
where
rules
are
flat
.
More
formally
,
a
(
tree-to-string
)
translation
rule
which
perform
phrasal
translations
for
the
two
remaining
subtrees
,
respectively
,
and
get
the
Chinese
translation
in
(
e
)
.
3
Forest-based
translation
We
now
extend
the
tree-based
idea
from
the
previous
section
to
the
case
of
forest-based
translation
.
Again
,
there
are
two
steps
,
parsing
and
decoding
.
In
the
former
,
a
(
modified
)
parser
will
parse
the
input
sentence
and
output
a
packed
forest
(
Section
3.1
)
rather
than
just
the
1-best
tree
.
Such
a
forest
is
usually
huge
in
size
,
so
we
use
the
forest
pruning
algorithm
(
Section
3.4
)
to
reduce
it
to
a
reasonable
size
.
The
pruned
parse
forest
will
then
be
used
to
direct
the
translation
.
In
the
decoding
step
,
we
first
convert
the
parse
forest
into
a
translation
forest
using
the
translation
rule
set
,
by
similar
techniques
of
pattern-matching
from
tree-based
decoding
(
Section
3.2
)
.
Then
the
decoder
searches
for
the
best
derivation
on
the
translation
forest
and
outputs
the
target
string
(
Section
3.3
)
.
Informally
,
a
packed
parse
forest
,
or
forest
in
short
,
is
a
compact
representation
of
all
the
derivations
(
i.e.
,
parse
trees
)
for
a
given
sentence
under
a
context-free
grammar
(
Billot
and
Lang
,
1989
)
.
For
example
,
consider
the
Chinese
sentence
in
Example
(
2
)
above
,
which
has
(
at
least
)
two
readings
depending
on
the
part-of-speech
of
the
word
yu
,
which
can
be
either
a
preposition
(
P
"
with
"
)
or
a
conjunction
(
CC
"
and
"
)
.
The
parse
tree
for
the
preposition
case
is
shown
in
Figure
2
(
b
)
as
the
1-best
parse
,
while
for
the
conjunction
case
,
the
two
proper
nouns
(
Bushi
and
Shalong
)
are
combined
to
form
a
coordinated
NP
which
functions
as
the
subject
of
the
sentence
.
In
this
case
the
Chinese
sentence
is
translated
into
(
3
)
"
[
Bush
and
Sharon
]
held
a
talk
"
.
Shown
in
Figure
3
(
a
)
,
these
two
parse
trees
can
be
represented
as
a
single
forest
by
sharing
common
subtrees
such
as
NPBo
,
i
and
VPB3,6
.
Such
a
forest
has
a
structure
of
a
hypergraph
(
Klein
and
Manning
,
2001
;
Huang
and
Chiang
,
2005
)
,
where
items
like
NP0
3
are
called
nodes
,
and
deductive
steps
like
(
*
)
correspond
to
hyperedges
.
(
(
NPBo,1
,
CC1,2
,
NPB2,3
)
,
NPo,3
)
.
There
is
also
a
distinguished
root
node
TOP
in
each
forest
,
denoting
the
goal
item
in
parsing
,
which
is
simply
S0^
where
S
is
the
start
symbol
and
l
is
the
sentence
length
.
3.2
Translation
Forest
Given
a
parse
forest
and
a
translation
rule
set
R
,
we
can
generate
a
translation
forest
which
has
a
similar
hypergraph
structure
.
Basically
,
just
as
the
depth-first
traversal
procedure
in
tree-based
decoding
(
Figure
2
)
,
we
visit
in
top-down
order
each
node
v
in
the
Bushi
yu
Shaalong
juxing
le
huitan
JJ
translation
rule
set
R
translation
hyperedge
translation
rule
Figure
3
:
(
a
)
the
parse
forest
of
the
example
sentence
;
solid
hyperedges
denote
the
1-best
parse
in
Figure
2
(
b
)
while
dashed
hyperedges
denote
the
alternative
parse
due
to
Deduction
(
*
)
.
(
b
)
the
corresponding
translation
forest
after
applying
the
translation
rules
(
lexical
rules
not
shown
)
;
the
derivation
shown
in
bold
solid
lines
(
e1
and
e3
)
corresponds
to
the
derivation
in
Figure
2
;
the
one
shown
in
dashed
lines
(
e2
,
e5
,
and
e6
)
uses
the
alternative
parse
and
corresponds
to
the
translation
in
Example
(
3
)
.
(
c
)
the
correspondence
between
translation
hyperedges
and
translation
rules
.
parse
forest
,
and
try
to
pattern-match
each
translation
rule
r
against
the
local
sub-forest
under
node
v.
For
example
,
in
Figure
3
(
a
)
,
at
node
VP^6
,
two
rules
r3
and
r7
both
matches
the
local
subforest
,
and
will
thus
generate
two
translation
hyperedges
e3
and
e4
(
see
Figure
3
(
b-c
)
)
.
More
formally
,
we
define
a
function
match
(
r
,
v
)
which
attempts
to
pattern-match
rule
r
at
node
v
in
the
parse
forest
,
and
in
case
of
success
,
returns
a
list
of
descendent
nodes
of
v
that
are
matched
to
the
variables
in
r
,
or
returns
an
empty
list
if
the
match
fails
.
Note
that
this
procedure
is
recursive
and
may
Pseudocode
1
The
conversion
algorithm
.
Input
:
parse
forest
Hp
and
rule
set
R
Output
:
translation
forest
Ht
for
each
node
v
e
Vp
in
top-down
order
do
for
each
translation
rule
r
e
R
do
if
vars
is
not
empty
then
add
translation
hyperedge
e
to
Ht
involve
multiple
parse
hyperedges
.
For
example
,
which
covers
three
parse
hyperedges
,
while
nodes
in
gray
do
not
pattern-match
any
rule
(
although
they
are
involved
in
the
matching
of
other
nodes
,
where
they
match
interior
nodes
of
the
source-side
tree
fragments
in
a
rule
)
.
We
can
thus
construct
a
translation
hyperedge
from
match
(
r
,
v
)
to
v
for
each
node
v
and
rule
r.
In
addition
,
we
also
need
to
keep
track
of
the
target
string
s
(
r
)
specified
by
rule
r
,
which
includes
target-language
terminals
and
variables
.
For
example
,
s
(
r3
)
=
"
held
x2
with
x1
"
.
The
subtranslations
of
the
matched
variable
nodes
will
be
substituted
for
the
variables
in
s
(
r
)
to
get
a
complete
translation
for
node
v.
So
a
translation
hyperedge
e
is
a
triple
(
tails
(
e
)
,
head
(
e
)
,
s
)
where
s
is
the
target
string
from
the
rule
,
for
example
,
e3
=
(
(
NPB2,3
,
NPB5,6
)
,
VP1,6
,
"
held
x2
with
x1
"
)
.
This
procedure
is
summarized
in
Pseudocode
1
.
3.3
Decoding
Algorithms
The
decoder
performs
two
tasks
on
the
translation
forest
:
1-best
search
with
integrated
language
model
(
LM
)
,
and
k-best
search
with
LM
to
be
used
in
minimum
error
rate
training
.
Both
tasks
can
be
done
efficiently
by
forest-based
algorithms
based
on
k-best
parsing
(
Huang
and
Chiang
,
2005
)
.
For
1-best
search
,
we
use
the
cube
pruning
technique
(
Chiang
,
2007
;
Huang
and
Chiang
,
2007
)
which
approximately
intersects
the
translation
forest
with
the
LM
.
Basically
,
cube
pruning
works
bottom
up
in
a
forest
,
keeping
at
most
k
+LM
items
at
each
node
,
and
uses
the
best-first
expansion
idea
from
the
Algorithm
2
of
Huang
and
Chiang
(
2005
)
to
speed
up
the
computation
.
An
+LM
item
of
node
v
has
the
form
(
va
*
b
)
,
where
a
and
b
are
the
target-language
boundary
words
.
For
example
,
(
VP
*
Sharon
)
is
an
+LM
item
with
its
translation
starting
with
"
held
"
and
ending
with
"
Sharon
"
.
This
scheme
can
be
easily
extended
to
work
with
a
general
n-gram
by
storing
n
—
1
words
at
both
ends
(
Chiang
,
2007
)
.
For
k-best
search
after
getting
1-best
derivation
,
we
use
the
lazy
Algorithm
3
of
Huang
and
Chiang
(
2005
)
that
works
backwards
from
the
root
node
,
incrementally
computing
the
second
,
third
,
through
the
kth
best
alternatives
.
However
,
this
time
we
work
on
a
finer-grained
forest
,
called
translation+LMforest
,
resulting
from
the
intersection
of
the
translation
forest
and
the
LM
,
with
its
nodes
being
the
+LM
items
during
cube
pruning
.
Although
this
new
forest
is
prohibitively
large
,
Algorithm
3
is
very
efficient
with
minimal
overhead
on
top
of
1-best
.
3.4
Forest
Pruning
Algorithm
We
use
the
pruning
algorithm
of
(
Jonathan
Graehl
,
p.c.
;
Huang
,
2008
)
that
is
very
similar
to
the
method
based
on
marginal
probability
(
Charniak
and
Johnson
,
2005
)
,
except
that
it
prunes
hyperedges
as
well
as
nodes
.
Basically
,
we
use
an
Inside-Outside
algorithm
to
compute
the
Viterbi
inside
cost
/
/
(
v
)
and
the
Viterbi
outside
cost
a
(
v
)
for
each
node
v
,
and
then
compute
the
merit
a
/
/
(
e
)
for
each
hyperedge
:
Uidtails
(
e
)
Intuitively
,
this
merit
is
the
cost
of
the
best
derivation
that
traverses
e
,
and
the
difference
5
(
e
)
=
a
/
/
(
e
)
—
/
(
TOP
)
can
be
seen
as
the
distance
away
from
the
globally
best
derivation
.
We
prune
away
a
hyper-edge
e
if
5
(
e
)
&gt;
p
for
a
threshold
p.
Nodes
with
all
incoming
hyperedges
pruned
are
also
pruned
.
4
Experiments
on
target
translation
.
The
derivation
probability
conditioned
on
1-best
tree
,
P
(
d
|
T
)
,
should
now
be
replaced
by
P
(
d
|
Hp
)
where
Hp
is
the
parse
forest
,
which
decomposes
into
the
product
of
probabilities
of
translation
rules
r
e
d
:
where
each
P
(
r
)
is
the
product
of
five
probabilities
:
Here
t
and
s
are
the
source-side
tree
and
target-side
string
of
rule
r
,
respectively
,
P
(
t
|
s
)
and
P
(
s
|
t
)
are
the
two
translation
probabilities
,
and
Plex
(
)
are
the
lexical
probabilities
.
The
only
extra
term
in
forest-based
decoding
is
P
(
t
|
Hp
)
denoting
the
source
side
parsing
probability
ofthe
current
translation
rule
r
in
the
parse
forest
,
which
is
the
product
of
probabilities
of
each
parse
hyperedge
ep
covered
in
the
pattern-match
of
t
against
Hp
(
which
can
be
recorded
at
conversion
time
)
:
Our
experiments
are
on
Chinese-to-English
translation
,
and
we
use
the
Chinese
parser
of
Xiong
et
al.
(
2005
)
to
parse
the
source
side
ofthe
bitext
.
Following
Huang
(
2008
)
,
we
modify
the
parser
to
output
a
packed
forest
for
each
sentence
.
translation
rules
.
Note
that
our
rule
extraction
is
still
done
on
1-best
parses
,
while
decoding
is
on
k-best
parses
or
packed
forests
.
We
also
use
the
SRI
Language
Modeling
Toolkit
(
Stolcke
,
2002
)
to
train
a
trigram
language
model
with
Kneser-Ney
smoothing
on
the
English
side
of
the
bitext
.
Figure
4
:
Comparison
of
decoding
on
forests
with
decoding
on
k-best
trees
.
NIST
MT
Evaluation
test
set
as
our
test
set
(
1082
sentences
)
,
with
on
average
28.28
and
26.31
words
per
sentence
,
respectively
.
We
evaluate
the
translation
quality
using
the
case-sensitive
BLEU-4
metric
(
Papineni
et
al.
,
2002
)
.
We
use
the
standard
minimum
error-rate
training
(
Och
,
2003
)
to
tune
the
feature
weights
to
maximize
the
system
's
BLEU
score
on
the
dev
set
.
On
dev
and
test
sets
,
we
prune
the
Chinese
parse
forests
by
the
forest
pruning
algorithm
in
Section
3.4
with
a
threshold
of
p
=
12
,
and
then
convert
them
into
translation
forests
using
the
algorithm
in
Section
3.2
.
To
increase
the
coverage
of
the
rule
set
,
we
also
introduce
a
default
translation
hyperedge
for
each
parse
hyperedge
by
mono-tonically
translating
each
tail
node
,
so
that
we
can
always
at
least
get
a
complete
translation
in
the
end
.
The
BLEU
score
of
the
baseline
1-best
decoding
is
0.2325
,
which
is
consistent
with
the
result
of
0.2302
in
(
Liu
et
al.
,
2007
)
on
the
same
training
,
development
and
test
sets
,
and
with
the
same
rule
extraction
procedure
.
The
corresponding
BLEU
score
of
Pharaoh
(
Koehn
,
2004
)
is
0.2182
on
this
dataset
.
Figure
4
compares
forest
decoding
with
decoding
on
k-best
trees
in
terms
of
speed
and
quality
.
Using
more
than
one
parse
tree
apparently
improves
the
BLEU
score
,
but
at
the
cost
of
much
slower
decoding
,
since
each
of
the
top-k
trees
has
to
be
decoded
individually
although
they
share
many
common
subtrees
.
Forest
decoding
,
by
contrast
,
is
much
faster
forest
decoding
-
Figure
5
:
Percentage
of
the
i-th
best
parse
tree
being
picked
in
decoding
.
32
%
of
the
distribution
for
forest
decoding
is
beyond
top-100
and
is
not
shown
on
this
plot
.
and
produces
consistently
better
BLEU
scores
.
With
pruning
threshold
p
=
12
,
it
achieved
a
BLEU
score
of
0.2485
,
which
is
an
absolute
improvement
of
1.6
%
points
over
the
1-best
baseline
,
and
is
statistically
significant
using
the
sign-test
of
Collins
et
al.
(
2005
)
(
p
&lt;
0.01
)
.
We
also
investigate
the
question
of
how
often
the
ith-best
parse
tree
is
picked
to
direct
the
translation
(
i
=
1
,
2
,
.
.
.
)
,
in
both
k-best
and
forest
decoding
schemes
.
A
packed
forest
can
be
roughly
viewed
as
a
(
virtual
)
oo-best
list
,
and
we
can
thus
ask
how
often
is
a
parse
beyond
top-k
used
by
a
forest
,
which
relates
to
the
fundamental
limitation
of
k-best
lists
.
Figure
5
shows
that
,
the
1-best
parse
is
still
preferred
25
%
of
the
time
among
30-best
trees
,
and
23
%
of
the
time
by
the
forest
decoder
.
These
ratios
decrease
dramatically
as
i
increases
,
but
the
forest
curve
has
a
much
longer
tail
in
large
i.
Indeed
,
40
%
of
the
trees
preferred
by
a
forest
is
beyond
top-30
,
32
%
is
beyond
top-100
,
and
even
20
%
beyond
top-1000
.
This
confirms
the
fact
that
we
need
exponentially
large
k-best
lists
with
the
explosion
of
alternatives
,
whereas
a
forest
can
encode
these
information
compactly
.
We
also
conduct
experiments
on
a
larger
dataset
,
which
contains
2.2M
training
sentence
pairs
.
Besides
the
trigram
language
model
trained
on
the
English
side
of
these
bitext
,
we
also
use
another
tri-gram
model
trained
on
the
first
1
/
3
of
the
Xinhua
portion
of
Gigaword
corpus
.
The
two
LMs
have
dis
-
approach
\
ruleset
Table
1
:
BLEU
score
results
from
training
on
large
data
.
tinct
weights
tuned
by
minimum
error
rate
training
.
The
dev
and
test
sets
remain
the
same
as
above
.
Furthermore
,
we
also
make
use
of
bilingual
phrases
to
improve
the
coverage
of
the
ruleset
.
Following
Liu
et
al.
(
2006
)
,
we
prepare
a
phrase-table
from
a
phrase-extractor
,
e.g.
Pharaoh
,
and
at
decoding
time
,
for
each
node
,
we
construct
on-the-fly
flat
translation
rules
from
phrases
that
match
the
source-side
span
of
the
node
.
These
phrases
are
called
syntactic
phrases
which
are
consistent
with
syntactic
constituents
(
Chiang
,
2005
)
,
and
have
been
shown
to
be
helpful
in
tree-based
systems
(
Galley
et
al.
,
2006
;
Liu
et
al.
,
2006
)
.
The
final
results
are
shown
in
Table
1
,
where
TR
denotes
translation
rule
only
,
and
TR+BP
denotes
the
inclusion
of
bilingual
phrases
.
The
BLEU
score
of
forest
decoder
with
TR
is
0.2839
,
which
is
a
1.7
%
points
improvement
over
the
1-best
baseline
,
and
this
difference
is
statistically
significant
(
p
&lt;
0.01
)
.
Using
bilingual
phrases
further
improves
the
BLEU
score
by
3.1
%
points
,
which
is
2.1
%
points
higher
than
the
respective
1-best
baseline
.
We
suspect
this
larger
improvement
is
due
to
the
alternative
constituents
in
the
forest
,
which
activates
many
syntactic
phrases
suppressed
by
the
1-best
parse
.
5
Conclusion
and
future
work
We
have
presented
a
novel
forest-based
translation
approach
which
uses
a
packed
forest
rather
than
the
1-best
parse
tree
(
or
k-best
parse
trees
)
to
direct
the
translation
.
Forest
provides
a
compact
data-structure
for
efficient
handling
of
exponentially
many
tree
structures
,
and
is
shown
to
be
a
promising
direction
with
state-of-the-art
translation
results
and
reasonable
decoding
speed
.
This
work
can
thus
be
viewed
as
a
compromise
between
string-based
and
tree-based
paradigms
,
with
a
good
trade-offbetween
speed
and
accuarcy
.
For
future
work
,
we
would
like
to
use
packed
forests
not
only
in
decoding
,
but
also
for
translation
rule
extraction
during
training
.
Acknowledgement
like
to
thank
Chris
Quirk
for
inspirations
,
Yang
Liu
for
help
with
rule
extraction
,
Mark
Johnson
for
posing
the
question
of
virtual
o
-
best
list
,
and
the
anonymous
reviewers
for
suggestions
.
