A
major
engineering
challenge
in
statistical
machine
translation
systems
is
the
efficient
representation
of
extremely
large
translation
rulesets
.
In
phrase-based
models
,
this
problem
can
be
addressed
by
storing
the
training
data
in
memory
and
using
a
suffix
array
as
an
efficient
index
to
quickly
lookup
and
extract
rules
on
the
fly
.
Hierarchical
phrase-based
translation
introduces
the
added
wrinkle
of
source
phrases
with
gaps
.
Lookup
algorithms
used
for
contiguous
phrases
no
longer
apply
and
the
best
approximate
pattern
matching
algorithms
are
much
too
slow
,
taking
several
minutes
per
sentence
.
We
describe
new
lookup
algorithms
for
hierarchical
phrase-based
translation
that
reduce
the
empirical
computation
time
by
nearly
two
orders
of
magnitude
,
making
on-the-fly
lookup
feasible
for
source
phrases
with
gaps
.
1
Introduction
Current
statistical
machine
translation
systems
rely
on
very
large
rule
sets
.
In
phrase-based
systems
,
rules
are
extracted
from
parallel
corpora
containing
tens
or
hundreds
of
millions
of
words
.
This
can
result
in
millions
of
rules
using
even
the
most
conservative
extraction
heuristics
.
Efficient
algorithms
for
rule
storage
and
access
are
necessary
for
practical
decoding
algorithms
.
They
are
crucial
to
keeping
up
with
the
ever-increasing
size
of
parallel
corpora
,
as
well
as
the
introduction
of
new
data
sources
such
as
web-mined
and
comparable
corpora
.
Until
recently
,
most
approaches
to
this
problem
involved
substantial
tradeoffs
.
The
common
practice
of
test
set
filtering
renders
systems
impractical
for
all
but
batch
processing
.
Tight
restrictions
on
phrase
length
curtail
the
power
of
phrase-based
models
.
However
,
some
promising
engineering
solutions
are
emerging
.
Zens
and
Ney
(
2007
)
use
a
disk-based
prefix
tree
,
enabling
efficient
access
to
phrase
tables
much
too
large
to
fit
in
main
memory
.
An
alternative
approach
introduced
independently
by
both
Callison-Burch
et
al.
(
2005
)
and
Zhang
and
Vogel
(
2005
)
is
to
store
the
training
data
itself
in
memory
,
and
use
a
suffix
array
as
an
efficient
index
to
look
up
,
extract
,
and
score
phrase
pairs
on
the
fly
.
We
believe
that
the
latter
approach
has
several
important
applications
(
§
7
)
.
So
far
,
these
techniques
have
focused
on
phrase-based
models
using
contiguous
phrases
(
Koehn
et
al.
,
2003
;
Och
and
Ney
,
2004
)
.
Some
recent
models
permit
discontiguous
phrases
(
Chiang
,
2007
;
Quirk
et
al.
,
2005
;
Simard
et
al.
,
2005
)
.
Of
particular
interest
to
us
is
the
hierarchical
phrase-based
model
of
Chiang
(
2007
)
,
which
has
been
shown
to
be
superior
to
phrase-based
models
.
The
ruleset
extracted
by
this
model
is
a
superset
of
the
ruleset
in
an
equivalent
phrase-based
model
,
and
it
is
an
order
of
magnitude
larger
.
This
makes
efficient
rule
representation
even
more
critical
.
We
tackle
the
problem
using
the
online
rule
extraction
method
of
Callison-Burch
et
al.
(
2005
)
and
Zhang
and
Vogel
(
2005
)
.
The
problem
statement
for
our
work
is
:
Given
an
input
sentence
,
efficiently
find
all
hierarchical
phrase-based
translation
rules
for
that
sentence
in
the
training
corpus
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
916-985
,
Prague
,
June
2001
.
©
2001
Association
for
Computational
Linguistics
We
first
review
suffix
arrays
(
§
2
)
and
hierarchical
phrase-based
translation
(
§
3
)
.
We
show
that
the
obvious
approach
using
state-of-the-art
pattern
matching
algorithms
is
hopelessly
inefficient
(
§
4
)
.
We
then
describe
a
series
of
algorithms
to
address
this
inefficiency
(
§
5
)
.
Our
algorithms
reduce
computation
time
by
two
orders
of
magnitude
,
making
the
approach
feasible
(
§
6
)
.
We
close
with
a
discussion
that
describes
several
applications
of
our
work
(
§
7
)
.
2
Suffix
Arrays
A
suffix
array
is
a
data
structure
representing
all
suffixes
of
a
corpus
in
lexicographical
order
(
Manber
and
Myers
,
1993
)
.
Formally
,
for
a
text
T
,
the
ith
suffix
of
T
is
the
substring
of
the
text
beginning
at
position
i
and
continuing
to
the
end
of
T.
This
suffix
can
be
uniquely
identified
by
the
index
i
of
its
first
word
.
The
suffix
array
SAT
of
T
is
a
permutation
of
[
1
,
|
T
|
]
arranged
by
the
lexicographical
order
of
the
corresponding
suffixes
.
This
representation
enables
fast
lookup
of
any
contiguous
substring
using
binary
search
.
Specifically
,
all
occurrences
of
a
length-m
substring
can
be
found
in
O
(
m
+
log
|
T
|
)
time
(
Manber
and
Myers
,
1993
)
.
Callison-Burch
et
al.
(
2005
)
and
Zhang
and
Vogel
(
2005
)
use
suffix
arrays
as
follows
.
Load
the
source
training
text
F
,
the
suffix
array
SAF
,
the
target
training
text
E
,
and
the
alignment
A
into
memory
.
For
each
input
sentence
,
look
up
each
substring
(
phrase
)
f
of
the
sentence
in
the
suffix
array
.
For
each
instance
of
f
found
in
F
,
find
its
aligned
phrase
e
using
the
phrase
extraction
method
of
Koehn
et
al.
(
2003
)
.
Compute
the
relative
frequency
score
p
(
e
|
f
)
of
each
pair
using
the
count
of
the
extracted
pair
and
the
marginal
count
of
f
.
Compute
the
lexical
weighting
score
of
the
phrase
pair
using
the
alignment
that
gives
the
best
score
.
1
Abouelhoda
et
al.
(
2004
)
show
that
lookup
can
be
done
in
optimal
O
(
m
)
time
using
some
auxiliaray
data
structures
.
For
our
purposes
O
(
m
+
log
|
T
|
)
is
practical
,
since
for
the
27M-word
corpus
used
to
carry
out
our
experiments
,
log
|
T
|
~
25
.
Use
the
scored
rules
to
translate
the
input
sentence
with
a
standard
decoding
algorithm
.
A
difficulty
with
this
approach
is
step
3
,
which
can
be
quite
slow
.
Its
complexity
is
linear
in
the
number
of
occurrences
of
the
source
phrase
f
.
Both
Callison-Burch
et
al.
(
2005
)
and
Zhang
and
Vogel
(
2005
)
solve
this
with
sampling
.
If
a
source
phrase
appears
more
than
k
times
,
they
sample
only
k
occurrences
for
rule
extraction
.
Both
papers
report
that
translation
performance
is
nearly
identical
to
extracting
all
possible
phrases
when
k
=
100
.
3
Hierarchical
Phrase-Based
Translation
We
consider
the
hierarchical
translation
model
of
Chiang
(
2007
)
.
Formally
,
this
model
is
a
synchronous
context-free
grammar
.
The
lexicalized
translation
rules
of
the
grammar
may
contain
a
single
nonterminal
symbol
,
denoted
X
.
We
will
use
a
,
b
,
c
and
d
to
denote
terminal
symbols
,
and
u
,
v
,
and
w
to
denote
(
possibly
empty
)
sequences
of
these
terminals
.
We
will
additionally
use
a
and
/
/
to
denote
(
possibly
empty
)
sequences
containing
both
terminals
and
nonterminals
.
A
translation
rule
is
written
X
—
a
/
/
?
.
This
rule
states
that
a
span
of
the
input
matching
a
is
replaced
by
/
in
translation
.
We
require
that
a
and
/
contain
an
equal
number
(
possibly
zero
)
of
coindexed
nonterminals
.
An
example
rule
with
coindexes
is
X
—
uXq
]
vX^w
/
u'X^v'Xq
]
w
'
.
When
discussing
only
the
source
side
of
such
rules
,
we
will
leave
out
the
coindexes
.
For
instance
,
the
source
side
of
the
above
rule
will
be
written
uXvXw.3
For
the
purposes
of
this
paper
,
we
adhere
to
the
restrictions
described
by
Chiang
(
2007
)
for
rules
extracted
from
the
training
data
.
•
Rules
can
contain
at
most
two
nonterminals
.
•
Rules
can
contain
at
most
five
terminals
.
•
Rules
can
span
at
most
ten
words
.
2A
sample
size
of
100
is
actually
quite
small
for
many
phrases
,
some
of
which
occur
tens
or
hundreds
of
thousands
of
times
.
It
is
perhaps
surprising
that
such
a
small
sample
size
works
as
well
as
the
full
data
.
However
,
recent
work
by
Och
(
2005
)
and
Federico
and
Bertoldi
(
2006
)
has
shown
that
the
statistics
used
by
phrase-based
systems
are
not
very
precise
.
3In
the
canonical
representation
of
the
grammar
,
source-side
coindexes
are
always
in
sorted
order
,
making
them
unambiguous
.
•
Nonterminals
must
span
at
least
two
words
.
•
Adjacent
nonterminals
are
disallowed
in
the
source
side
of
a
rule
.
Expressed
more
economically
,
we
say
that
our
goal
is
to
search
for
source
phrases
in
the
form
u
,
uXv
,
or
uXvXw
,
where
1
&lt;
|
uvw
|
&lt;
5
,
and
|
v
|
&gt;
0
in
the
final
case
.
Note
that
the
model
also
allows
rules
in
the
form
Xu
,
uX
,
XuX
,
XuXv
,
and
uXvX
.
However
,
these
rules
are
lexically
identical
to
other
rules
,
and
thus
will
match
the
same
locations
in
the
source
text
.
4
The
Collocation
Problem
On-the-fly
lookup
using
suffix
arrays
involves
an
added
complication
when
the
rules
are
in
form
uXv
or
uXvXw
.
Binary
search
enables
fast
lookup
of
contiguous
substrings
.
However
,
it
cannot
be
used
for
discontiguous
substrings
.
Consider
the
rule
aXbXc
.
If
we
search
for
this
rule
in
the
following
logical
suffix
array
fragment
,
we
will
find
the
boldfaced
matches
.
Even
though
these
suffixes
are
in
lexicographical
order
,
matching
suffixes
are
interspersed
with
non-matching
suffixes
.
We
will
need
another
algorithm
to
find
the
source
rules
containing
at
least
one
X
surrounded
by
nonempty
sequences
of
terminal
symbols
.
4.1
Baseline
Approach
In
the
pattern-matching
literature
,
words
spanned
by
the
nonterminal
symbols
of
Chiang
's
grammar
are
called
don
't
cares
and
a
nonterminal
symbol
in
a
query
pattern
that
matches
a
sequence
of
don
't
cares
is
called
a
variable
length
gap
.
The
search
problem
for
patterns
containing
these
gaps
is
a
variant
of
approximate
pattern
matching
,
which
has
received
substantial
attention
(
Navarro
,
2001
)
.
The
best
algorithm
for
pattern
matching
with
variable-length
gaps
in
a
suffix
array
is
a
recent
algorithm
by
Rahman
et
al.
(
2006
)
.
It
works
on
a
pattern
wiXw2X
.
.
.
w
/
consisting
of
/
contiguous
substrings
w1
,
w2
,
.
.
.
w
/
,
each
separated
by
a
gap
.
The
algorithm
is
straightforward
.
After
identifying
all
ui
occurrences
of
each
wi
in
O
(
|
wi
|
+
log
|
T
|
)
time
,
collocations
that
meet
the
gap
constraints
are
computed
using
an
efficient
data
structure
called
a
stratified
tree
(
van
Emde
Boas
et
al.
,
1977
)
.
4
Although
we
refer
the
reader
to
the
source
text
for
a
full
description
of
this
data
structure
,
its
salient
characteristic
is
that
it
implements
priority
queue
operations
insert
and
next-element
in
O
(
loglog
|
T
|
)
time
.
Therefore
,
the
total
running
time
for
an
algorithm
to
find
all
contiguous
subpatterns
and
compute
their
collocations
is
O
(
Ef
=
i
[
|
wi
|
+
1og
|
T
|
+
Ui
log
log
|
T
|
]
)
.
We
can
improve
on
the
algorithm
of
Rahman
et
al.
(
2006
)
using
a
variation
on
the
idea
of
hashing
.
We
exploit
the
fact
that
our
large
text
is
actually
a
collection
of
relatively
short
sentences
,
and
that
collocated
patterns
must
occur
in
the
same
sentence
in
order
to
be
considered
a
rule
.
Therefore
,
we
can
use
the
sentence
id
of
each
subpattern
occurrence
as
a
kind
of
hash
key
.
We
create
a
hash
table
whose
size
is
exactly
the
number
of
sentences
in
our
training
corpus
.
Each
location
of
the
partially
matched
pattern
wi
X.
.
.
Xwi
is
inserted
into
the
hash
bucket
with
the
matching
sentence
id
.
To
find
collocated
patterns
wi+1
,
we
probe
the
hash
table
with
each
of
the
ui+1
locations
for
that
subpattern
.
When
a
match
is
found
,
we
compare
the
element
with
all
elements
in
the
bucket
to
see
if
it
is
within
the
window
imposed
by
the
phrase
length
constraints
.
Theoretically
,
the
worst
case
for
this
algorithm
occurs
when
all
elements
of
both
sets
resolve
to
the
same
hash
bucket
,
and
we
must
compare
all
elements
of
one
set
with
all
elements
of
the
other
set
.
This
leads
to
a
worst
case
complexity
of
O
(
^i
=
1
[
|
wi
|
+
1og
|
T
|
]
+
nf
=
1
ui
)
.
However
,
for
real
language
data
the
performance
for
sets
of
any
significant
size
will
be
/
=
1
[
|
wi
|
+
1og
|
T
|
+
ui
]
)
,
since
most
patterns
will
occur
once
in
any
given
sentence
.
It
is
instructive
to
compare
this
with
the
complexity
for
contiguous
phrases
.
In
that
case
,
total
lookup
time
is
O
(
|
w
|
+
1og
|
T
|
)
for
a
contiguous
pattern
w.
4Often
known
in
the
literature
as
a
van
Emde
Boas
tree
or
van
Emde
Boas
priority
queue
.
The
crucial
difference
between
the
contiguous
and
discontiguous
case
is
the
added
term
/
=
1
ui
.
For
even
moderately
frequent
subpatterns
this
term
dominates
complexity
.
To
make
matters
concrete
,
consider
the
training
corpus
used
in
our
experiments
(
§
6
)
,
which
contains
27M
source
words
.
The
three
most
frequent
uni-grams
occur
1.48M
,
1.16M
and
688K
times
-
the
first
two
occur
on
average
more
than
once
per
sentence
.
In
the
worst
case
,
looking
up
a
contiguous
phrase
containing
any
number
and
combination
of
these
unigrams
requires
no
more
than
25
comparison
operations
.
In
contrast
,
the
worst
case
scenario
for
a
pattern
with
a
single
gap
,
bookended
on
either
side
by
the
most
frequent
word
,
requires
over
two
million
operations
using
our
baseline
algorithm
and
over
thirteen
million
using
the
algorithm
of
Rahman
et
al.
(
2006
)
.
A
single
frequent
word
in
an
input
sentence
is
enough
to
cause
noticeable
slowdowns
,
since
it
can
appear
in
up
to
530
hierarchical
rules
.
To
analyze
the
cost
empirically
,
we
ran
our
baseline
algorithm
on
the
first
50
sentences
of
the
NIST
Chinese-English
2003
test
set
and
measured
the
CPU
time
taken
to
compute
collocations
.
We
found
that
,
on
average
,
it
took
2241.25
seconds
(
~
37
minutes
)
per
sentence
just
to
compute
all
of
the
needed
collocations
.
By
comparison
,
decoding
time
per
sentence
is
roughly
10
seconds
with
moderately
aggressive
pruning
,
using
the
Python
implementation
of
Chiang
(
2007
)
.
5
Solving
the
Collocation
Problem
Clearly
,
looking
up
patterns
in
this
way
is
not
practical
.
To
analyze
the
problem
,
we
measured
the
amount
of
CPU
time
per
computation
.
Cumulative
lookup
time
was
dominated
by
a
very
small
fraction
of
the
computations
(
Fig
.
1
)
.
As
expected
,
further
analysis
showed
that
these
expensive
computations
all
involved
one
or
more
very
frequent
subpatterns
.
In
the
worst
cases
a
single
collocation
took
several
seconds
to
compute
.
However
,
there
is
a
silver
lining
.
Patterns
follow
a
Zipf
distribution
,
so
the
number
of
pattern
types
that
cause
the
problem
is
actually
quite
small
.
The
vast
majority
of
patterns
are
rare
.
Therefore
,
our
solution
focuses
on
computations
where
one
or
more
of
the
component
patterns
is
frequent
.
Assume
that
we
are
computing
a
collo
-
Computations
(
ranked
by
time
)
Figure
1
:
Ranked
computations
vs.
cumulative
time
.
A
small
fraction
of
all
computations
account
for
most
of
the
computational
time
.
cation
of
pattern
w1X
.
.
.
Xwi
and
pattern
wi+1
,
and
we
know
all
locations
ofeach
.
There
are
three
cases
.
•
If
both
patterns
are
frequent
,
we
resort
to
a
precomputed
intersection
(
§
5.1
)
.
We
were
not
aware
of
any
algorithms
to
substantially
improve
the
efficiency
of
this
computation
when
it
is
requested
on
the
fly
,
but
precomputation
can
be
done
in
a
single
pass
over
the
text
at
decoder
startup
.
•
If
one
pattern
is
frequent
and
the
other
is
rare
,
we
use
an
algorithm
whose
complexity
is
dependent
mainly
on
the
frequency
of
the
rare
pattern
(
§
5.2
)
.
It
can
also
be
used
for
pairs
of
rare
patterns
when
one
pattern
is
much
rarer
than
the
other
.
•
If
both
patterns
are
rare
,
no
special
algorithms
are
needed
.
Any
linear
algorithm
will
suffice
.
However
,
for
reasons
described
in
§
5.3
,
our
other
collocation
algorithms
depend
on
sorted
sets
,
so
we
use
a
merge
algorithm
.
Finally
,
in
order
to
cut
down
on
the
number
of
unnecessary
computations
,
we
use
an
efficient
method
to
enumerate
the
phrases
to
lookup
(
§
5.4
)
.
This
method
also
forms
the
basis
of
various
caching
strategies
for
additional
speedups
.
We
analyze
the
memory
use
of
our
algorithms
in
§
5.5
.
5.1
Precomputation
Precomputation
of
the
most
expensive
collocations
can
be
done
in
a
single
pass
over
the
text
.
As
input
,
our
algorithm
requires
the
identities
of
the
k
most
frequent
contiguous
patterns
.
5
It
then
iterates
over
the
corpus
.
Whenever
a
pattern
from
the
list
is
seen
,
we
push
a
tuple
consisting
of
its
identity
and
current
location
onto
a
queue
.
Whenever
the
oldest
item
on
the
queue
falls
outside
the
maximum
phrase
length
window
with
respect
to
the
current
position
,
we
compute
that
item
's
collocation
with
all
succeeding
patterns
(
subject
to
pattern
length
constraints
)
and
pop
it
from
the
queue
.
We
repeat
this
step
for
every
item
that
falls
outside
the
window
.
At
the
end
of
each
sentence
,
we
compute
collocations
for
any
remaining
items
in
the
queue
and
then
empty
it
.
Our
precomputation
includes
the
most
frequent
U-gram
subpatterns
.
Most
of
these
are
unigrams
,
but
in
our
experiments
we
found
5-grams
among
the
1000
most
frequent
patterns
.
We
precompute
the
locations
of
source
phrase
uXv
for
any
pair
u
and
v
that
both
appear
on
this
list
.
There
is
also
a
small
number
of
patterns
uXv
that
are
very
frequent
.
We
cannot
easily
obtain
a
list
of
these
in
advance
,
but
we
observe
that
they
always
consist
of
a
pair
u
and
v
of
patterns
from
near
the
top
of
the
frequency
list
.
Therefore
we
also
precompute
the
locations
uXvXw
of
patterns
in
which
both
u
and
v
are
among
these
super-frequent
patterns
(
all
unigrams
)
,
treating
this
as
the
collocation
of
the
frequent
pattern
uXv
and
frequent
pattern
w.
We
also
compute
the
analagous
case
for
u
and
vXw
.
5.2
Fast
Intersection
For
collocations
of
frequent
and
rare
patterns
,
we
use
a
fast
set
intersection
method
for
sorted
sets
called
double
binary
search
(
Baeza-Yates
,
2004
)
.
6
It
is
based
on
the
intuition
that
if
one
set
in
a
pair
of
sorted
sets
is
much
smaller
than
the
other
,
then
we
can
compute
their
intersection
efficiently
by
performing
a
binary
search
in
the
larger
data
set
D
for
each
element
of
the
smaller
query
set
Q.
Double
binary
search
takes
this
idea
a
step
further
.
It
performs
a
binary
search
in
D
for
the
median
element
of
Q.
Whether
or
not
the
element
is
found
,
the
5These
can
be
identified
using
a
single
traversal
over
a
longest
common
prefix
(
LCP
)
array
,
an
auxiliary
data
structure
of
the
suffix
array
,
described
by
Manber
and
Myers
(
1993
)
.
Since
we
don
't
need
the
LCP
array
at
runtime
,
we
chose
to
do
this
computation
once
offline
.
6Minor
modifications
are
required
since
we
are
computing
collocation
rather
than
intersection
.
Due
to
space
constraints
,
details
and
proof
of
correctness
are
available
in
Lopez
(
2007a
)
.
search
divides
both
sets
into
two
pairs
of
smaller
sets
that
can
be
processed
recursively
.
Detailed
analysis
and
empirical
results
on
an
information
retrieval
task
are
reported
in
Baeza-Yates
(
2004
)
and
Baeza-Yates
and
Salinger
(
2005
)
.
If
|
Q
|
log
|
D
|
&lt;
|
D
|
then
the
performance
is
guaranteed
to
be
sublinear
.
In
practice
it
is
often
sublinear
even
if
|
Q
|
log
|
D
|
is
somewhat
larger
than
|
D
|
.
In
our
implementation
we
simply
check
for
the
condition
A
|
Q
|
log
|
D
|
&lt;
|
D
|
to
decide
whether
we
should
use
double
binary
search
or
the
merge
algorithm
.
This
check
is
applied
in
the
recursive
cases
as
well
as
for
the
initial
inputs
.
The
variable
A
can
be
adjusted
for
performance
.
We
determined
experimentally
that
a
good
value
for
this
parameter
is
0.3
.
5.3
Obtaining
Sorted
Sets
Double
binary
search
requires
that
its
input
sets
be
in
sorted
order
.
However
,
the
suffix
array
returns
matchings
in
lexicographical
order
,
not
numeric
order
.
The
algorithm
of
Rahman
et
al.
(
2006
)
deals
with
this
problem
by
inserting
the
unordered
items
into
a
stratified
tree
.
This
requires
O
(
u
log
log
|
T
|
)
time
for
U
items
.
If
we
used
the
same
strategy
,
our
algorithm
would
no
longer
be
sublinear
.
An
alternative
is
to
precompute
all
U-gram
occurrences
in
order
and
store
them
in
an
inverted
index
.
This
can
be
done
in
one
pass
over
the
data
.
7
This
approach
requires
a
separate
inverted
index
for
each
U
,
up
to
the
maximum
U
used
by
the
model
.
The
memory
cost
is
one
length
-
|
T
|
array
per
index
.
In
order
to
avoid
the
full
U
|
T
|
cost
in
memory
,
our
implementation
uses
a
mixed
strategy
.
We
keep
a
precomputed
inverted
index
only
for
unigrams
.
For
bigrams
and
larger
U-grams
,
we
generate
the
index
on
the
fly
using
stratified
trees
.
This
results
in
a
superlinear
algorithm
for
intersection
.
However
,
we
can
exploit
the
fact
that
we
must
compute
collocations
multiple
times
for
each
input
U-gram
by
caching
the
sorted
set
after
we
create
it
(
The
caching
strategy
is
described
in
§
5.4
)
.
Subsequent
computations
involving
this
U-gram
can
then
be
done
in
linear
or
sublinear
time
.
Therefore
,
the
cost
of
building
the
inverted
index
on
the
fly
is
amortized
over
a
large
number
of
computations
.
7We
combine
this
step
with
the
other
precomputations
that
require
a
pass
over
the
data
,
thereby
removing
a
redundant
O
(
|
T
|
)
term
from
the
startup
cost
.
5.4
Efficient
Enumeration
A
major
difference
between
contiguous
phrase-based
models
and
hierarchical
phrase-based
models
is
the
number
of
rules
that
potentially
apply
to
an
input
sentence
.
To
make
this
concrete
,
on
our
data
,
with
an
average
of
29
words
per
sentence
,
there
were
on
average
133
contiguous
phrases
of
length
5
or
less
that
applied
.
By
comparison
,
there
were
on
average
7557
hierarchical
phrases
containing
up
to
5
words
.
These
patterns
are
obviously
highly
overlapping
and
we
employ
an
algorithm
to
exploit
this
fact
.
We
first
describe
a
baseline
algorithm
used
for
contiguous
phrases
(
§
5.4.1
)
.
We
then
introduce
some
improvements
(
§
5.4.2
)
and
describe
a
data
structure
used
by
the
algorithm
(
§
5.4.3
)
.
Finally
,
we
discuss
some
special
cases
for
discontiguous
phrases
(
§
5.4.4
)
.
Zhang
and
Vogel
(
2005
)
present
a
clever
algorithm
for
contiguous
phrase
searches
in
a
suffix
array
.
It
exploits
the
fact
that
for
each
m-length
source
phrase
that
we
want
to
look
up
,
we
will
also
want
to
look
up
its
(
m
—
1
)
-
length
prefix
.
They
observe
that
the
region
of
the
suffix
array
containing
all
suffixes
prefixed
by
ua
is
a
subset
of
the
region
containing
the
suffixes
prefixed
by
u.
Therefore
,
if
we
enumerate
the
phrases
of
our
sentence
in
such
a
way
that
we
always
search
for
u
before
searching
for
ua
,
we
can
restrict
the
binary
search
for
ua
to
the
range
containing
the
suffixes
prefixed
by
u.
If
the
search
for
u
fails
,
we
do
not
need
to
search
for
ua
at
all
.
They
show
that
this
approach
leads
to
some
time
savings
for
phrase
search
,
although
the
gains
are
relatively
modest
since
the
search
for
contiguous
phrases
is
not
very
expensive
to
begin
with
.
However
,
the
potential
savings
in
the
discontiguous
case
are
much
greater
.
5.4.2
Improvements
and
Extensions
We
can
improve
on
the
Zhang-Vogel
algorithm
.
An
m-length
contiguous
phrase
aub
depends
not
only
on
the
existence
of
its
prefix
au
,
but
also
on
the
existence
of
its
suffix
ub
.
In
the
contiguous
case
,
we
cannot
use
this
information
to
restrict
the
starting
range
of
the
binary
search
,
but
we
can
check
for
the
existence
of
ub
to
decide
whether
we
even
need
to
search
for
aub
at
all
.
This
can
help
us
avoid
searches
that
are
guaranteed
to
be
fruitless
.
Now
consider
the
discontiguous
case
.
As
in
the
analogous
contiguous
case
,
a
phrase
aab
will
only
exist
in
the
text
if
its
maximal
prefix
aa
and
maximal
suffix
ab
both
exist
in
the
corpus
and
overlap
at
specific
positions.8
Searching
for
aab
is
potentially
very
expensive
,
so
we
put
all
available
information
to
work
.
Before
searching
,
we
require
that
both
aa
and
ab
exist
.
Additionally
,
we
compute
the
location
of
aab
using
the
locations
of
both
maximal
subphrases
.
To
see
why
the
latter
optimization
is
useful
,
consider
a
phrase
abXcd
.
In
our
baseline
algorithm
,
we
would
search
for
ab
and
cd
,
and
then
perform
a
computation
to
see
whether
these
subphrases
were
collocated
within
an
elastic
window
.
However
,
if
we
instead
use
abXc
and
bXcd
as
the
basis
of
the
computation
,
we
gain
two
advantages
.
First
,
the
number
elements
of
each
set
is
likely
to
be
smaller
then
in
the
former
case
.
Second
,
the
computation
becomes
simpler
,
because
we
now
only
need
to
check
to
see
whether
the
patterns
exactly
overlap
with
a
starting
offset
of
one
,
rather
than
checking
within
a
window
of
locations
.
We
can
improve
efficiency
even
further
if
we
consider
cases
where
the
same
substring
occurs
more
than
once
within
the
same
sentence
,
or
even
in
multiple
sentences
.
If
the
computation
required
to
look
up
a
phrase
is
expensive
,
we
would
like
to
perform
the
lookup
only
once
.
This
requires
some
mechanism
for
caching
.
Depending
on
the
situation
,
we
might
want
to
cache
only
certain
subsets
of
phrases
,
based
on
their
frequency
or
difficulty
to
compute
.
We
would
also
like
the
flexibility
to
combine
on-the-fly
lookups
with
a
partially
precomputed
phrase
table
,
as
in
the
online
/
offline
mixture
of
Zhang
and
Vogel
(
2005
)
.
We
need
a
data
structure
that
provides
this
flexibility
,
in
addition
to
providing
fast
access
to
both
the
maximal
prefix
and
maximal
suffix
of
any
phrase
that
we
might
consider
.
Our
search
optimizations
are
easily
captured
in
a
prefix
tree
data
structure
augmented
with
suffix
links
.
Formally
,
a
prefix
tree
is
an
unminimized
deterministic
finite-state
automaton
that
recognizes
all
of
the
patterns
in
some
set
.
Each
node
in
the
tree
repre
-
8Except
when
a
=
X
,
in
which
case
a
and
b
must
be
collocated
within
a
window
defined
by
the
phrase
length
constraints
.
Figure
2
:
Illustration
of
prefix
tree
construction
showing
a
partial
prefix
tree
,
including
suffix
links
.
Suppose
we
are
interested
in
pattern
abXcd
,
represented
by
node
(
1
)
.
Its
prefix
is
represented
by
node
(
2
)
,
and
node
(
2
)
'
s
suffix
is
represented
by
node
(
3
)
.
Therefore
,
node
(
1
)
'
s
suffix
is
represented
by
the
node
pointed
to
by
the
d-edge
from
node
(
3
)
,
which
is
node
(
4
)
.
There
are
two
cases
.
In
case
1
,
node
(
4
)
is
inactive
,
so
we
can
mark
node
(
1
)
inactive
and
stop
.
In
case
2
,
node
(
4
)
is
active
,
so
we
compute
the
collocation
of
abXc
and
bXcd
with
information
stored
at
nodes
(
2
)
and
(
4
)
,
using
either
a
precomputed
intersection
,
double
binary
search
,
or
merge
,
depending
on
the
size
of
the
sets
.
If
the
result
is
empty
,
we
mark
the
node
inactive
.
Otherwise
,
we
store
the
results
at
node
(
1
)
and
add
its
successor
patterns
to
the
frontier
for
the
next
iteration
.
This
includes
all
patterns
containing
exactly
one
more
terminal
symbol
than
the
current
pattern
.
sents
the
prefix
of
a
unique
pattern
from
the
set
that
is
specified
by
the
concatenation
of
the
edge
labels
along
the
path
from
the
root
to
that
node
.
A
suffix
link
is
a
pointer
from
a
node
representing
path
aa
to
the
node
representing
path
a.
We
will
use
this
data
structure
to
record
the
set
of
patterns
that
we
have
searched
for
and
to
cache
information
for
those
that
were
found
successfully
.
Our
algorithm
generates
the
tree
breadth-search
along
a
frontier
.
In
the
mth
iteration
we
only
search
for
patterns
containing
m
terminal
symbols
.
Regardless
of
whether
we
find
a
particular
pattern
,
we
create
a
node
for
it
in
the
tree
.
If
the
pattern
was
found
in
the
corpus
,
its
node
is
marked
active
.
Otherwise
,
it
is
marked
inactive
.
For
found
patterns
,
we
store
either
the
endpoints
of
the
suffix
array
range
containing
the
phrase
(
if
it
is
contiguous
)
,
or
the
list
of
locations
at
which
the
phrase
is
found
(
if
it
is
discontiguous
)
.
We
can
also
store
the
extracted
rules
.
9
Whenever
a
pattern
is
successfully
found
,
we
add
all
patterns
with
m
+
1
terminals
that
are
prefixed
by
it
9Conveniently
,
the
implementation
of
Chiang
(
2007
)
uses
a
prefix
tree
grammar
encoding
,
as
described
in
Klein
and
Manning
(
2001
)
.
Our
implementation
decorates
this
tree
with
additional
information
required
by
our
algorithms
.
to
the
frontier
for
processing
in
the
next
iteration
.
To
search
for
a
pattern
,
we
use
location
information
from
its
parent
node
,
which
represents
its
maximal
prefix
.
Assuming
that
the
node
represents
phrase
ab
,
we
find
the
node
representing
its
maximal
suffix
by
following
the
b-edge
from
the
node
pointed
to
by
its
parent
node
's
suffix
link
.
If
the
node
pointed
to
by
this
suffix
link
is
inactive
,
we
can
mark
the
node
inactive
without
running
a
search
.
When
a
node
is
marked
inactive
,
we
discontinue
search
for
phrases
that
are
prefixed
by
the
path
it
represents
.
The
algorithm
is
illustrated
in
Figure
2
.
A
few
subtleties
arise
in
the
extraction
of
hierarchical
patterns
.
Gaps
are
allowed
to
occur
at
the
beginning
or
end
of
a
phrase
.
For
instance
,
we
may
have
a
source
phrase
Xu
or
uX
or
even
XuX
.
Although
each
of
these
phrases
requires
its
own
path
in
the
prefix
tree
,
they
are
lexically
identical
to
phrase
u.
An
analogous
situation
occurs
with
the
patterns
XuXv
,
uXvX
,
and
uXv
.
There
are
two
cases
that
we
are
concerned
with
.
The
first
case
consists
of
all
patterns
prefixed
with
X.
The
paths
to
nodes
representing
these
patterns
will
all
contain
the
X-edge
originating
at
the
root
node
.
All
of
these
paths
form
the
shadow
subtree
.
Path
construction
in
this
subtree
proceeds
differently
.
Because
they
are
lexically
identical
to
their
suffixes
,
they
are
automatically
extended
if
their
suffix
paths
are
active
,
and
they
inherit
location
information
of
their
suffixes
.
The
second
case
consists
of
all
patterns
suffixed
with
X.
Whenever
we
successfully
find
a
new
pattern
a
,
we
automatically
extend
it
with
an
X
edge
,
provided
that
aX
is
allowed
by
the
model
constraints
.
The
node
pointed
to
by
this
edge
inherits
its
location
information
from
its
parent
node
(
representing
the
maximal
prefix
a
)
.
Note
that
both
special
cases
occur
for
patterns
in
the
form
XuX
.
Number
of
frequent
subpatterns
Figure
3
:
Effect
of
precomputation
on
memory
use
and
processing
time
.
Here
we
show
only
the
memory
requirements
of
the
precomputed
collocations
.
5.5
Memory
Requirements
As
shown
in
Callison-Burch
et
al.
(
2005
)
,
we
must
keep
an
array
for
the
source
text
F
,
its
suffix
array
,
the
target
text
E
,
and
alignment
A
in
memory
.
Assuming
that
A
and
E
are
roughly
the
size
of
F
,
the
cost
is
4
|
T
|
.
If
we
assume
that
all
data
use
vocabularies
that
can
be
represented
using
32-bit
integers
,
then
our
27M
word
corpus
can
easily
be
represented
in
around
500MB
of
memory
.
Adding
the
inverted
index
for
unigrams
increases
this
by
20
%
.
The
main
additional
cost
in
memory
comes
from
the
storage
of
the
precomputed
collocations
.
This
is
dependent
both
on
the
corpus
size
and
the
number
of
collocations
that
we
choose
to
precompute
.
Using
detailed
timing
data
from
our
experiments
we
were
able
to
simulate
the
memory-speed
tradeoff
(
Fig
.
3
)
.
If
we
include
a
trigram
model
trained
on
our
bitext
and
the
Chinese
Gigaword
corpus
,
the
overall
storage
costs
for
our
system
are
approximately
2GB
.
6
Experiments
All
of
our
experiments
were
performed
on
Chinese-English
in
the
news
domain
.
We
used
a
large
training
set
consisting
of
over
1
million
sentences
from
various
newswire
corpora
.
This
corpus
is
roughly
the
same
as
the
one
used
for
large-scale
experiments
by
Chiang
et
al.
(
2005
)
.
To
generate
alignments
,
we
used
GIZA++
(
Och
and
Ney
,
2003
)
.
We
symmetrized
bidirectional
alignments
using
the
grow-diag-final
heuristic
(
Koehn
et
al.
,
2003
)
.
We
used
the
first
50
sentences
of
the
NIST
2003
test
set
to
compute
timing
results
.
All
of
our
algorithms
were
implemented
in
Python
2.4
.
10
Timing
results
are
reported
for
machines
with
8GB
of
memory
and
4
3GHz
Xeon
processors
running
Red
Hat
linux
2.6.9
.
In
order
to
understand
the
contributions
of
various
improvements
,
we
also
ran
the
system
with
with
various
ablations
.
In
the
default
setting
,
the
prefix
tree
is
constructed
for
each
sentence
to
guide
phrase
lookup
,
and
then
discarded
.
To
show
the
effect
of
caching
we
also
ran
the
algorithm
without
discarding
the
prefix
tree
between
sentences
,
resulting
in
full
inter-sentence
caching
.
The
results
are
shown
in
Table
1
.
It
is
clear
from
the
results
that
each
of
the
optimizations
is
needed
to
sufficiently
reduce
lookup
time
to
practical
levels
.
Although
this
is
still
relatively
slow
,
it
is
much
closer
to
the
decoding
time
of
10
seconds
per
sentence
than
the
baseline
.
10Python
is
an
interpreted
language
and
our
implementations
do
not
use
any
optimization
features
.
It
is
therefore
reasonable
to
think
that
a
more
efficient
reimplementation
would
result
in
across-the-board
speedups
.
11The
results
shown
here
do
not
include
the
startup
time
required
to
load
the
data
structures
into
memory
.
In
our
Python
implementation
this
takes
several
minutes
,
which
in
principle
should
be
amortized
over
the
cost
for
each
sentence
.
However
,
just
as
Zens
and
Ney
(
2007
)
do
for
phrase
tables
,
we
could
compile
our
data
structures
into
binary
memory-mapped
files
,
which
can
be
read
into
memory
in
a
matter
of
seconds
.
We
are
currently
investigating
this
option
in
a
C
reimplementation
.
Algorithms
Secs
/
Sent
Collocations
Table
1
:
Timing
results
and
number
of
collocations
computed
for
various
combinations
of
algorithms
.
The
runs
using
precomputation
use
the
1000
most
frequent
patterns
.
7
Conclusions
and
Future
Work
Our
work
solves
a
seemingly
intractable
problem
and
opens
up
a
number
of
intriguing
potential
applications
.
Both
Callison-Burch
et
al.
(
2005
)
and
Zhang
and
Vogel
(
2005
)
use
suffix
arrays
to
relax
the
length
constraints
on
phrase-based
models
.
Our
work
enables
this
in
hierarchical
phrase-based
models
.
However
,
we
are
interested
in
additional
applications
.
Recent
work
in
discriminative
learning
for
many
natural
language
tasks
,
such
as
part-of-speech
tagging
and
information
extraction
,
has
shown
that
feature
engineering
plays
a
critical
role
in
these
approaches
.
However
,
in
machine
translation
most
features
can
still
be
traced
back
to
the
IBM
Models
of
15
years
ago
(
Lopez
,
2007b
)
.
Recently
,
Lopez
and
Resnik
(
2006
)
showed
that
most
of
the
features
used
in
standard
phrase-based
models
do
not
help
very
much
.
Our
algorithms
enable
us
to
look
up
phrase
pairs
in
context
,
which
will
allow
us
to
compute
interesting
contextual
features
that
can
be
used
in
discriminative
learning
algorithms
to
improve
translation
accuracy
.
Essentially
,
we
can
use
the
training
data
itself
as
an
indirect
representation
of
whatever
features
we
might
want
to
compute
.
This
is
not
possible
with
table-based
architectures
.
Most
of
the
data
structures
and
algorithms
discussed
in
this
paper
are
widely
used
in
bioinformat-ics
,
including
suffix
arrays
,
prefix
trees
,
and
suffix
links
(
Gusfield
,
1997
)
.
As
discussed
in
§
4.1
,
our
problem
is
a
variant
of
the
approximate
pattern
matching
problem
.
A
major
application
of
approximate
pattern
matching
in
bioinformatics
is
query
processing
in
protein
databases
for
purposes
of
sequencing
,
phylogeny
,
and
motif
identification
.
Current
MT
models
,
including
hierarchical
mod
-
els
,
translate
by
breaking
the
input
sentence
into
small
pieces
and
translating
them
largely
independently
.
Using
approximate
pattern
matching
algorithms
,
we
imagine
that
machine
translation
could
be
treated
very
much
like
search
in
a
protein
database
.
In
this
scenario
,
the
goal
is
to
select
training
sentences
that
match
the
input
sentence
as
closely
as
possible
,
under
some
evaluation
function
that
accounts
for
both
matching
and
mismatched
sequences
,
as
well
as
possibly
other
data
features
.
Once
we
have
found
the
closest
sentences
we
can
translate
the
matched
portions
in
their
entirety
,
replacing
mismatches
with
appropriate
word
,
phrase
,
or
hierarchical
phrase
translations
as
needed
.
This
model
would
bring
statistical
machine
translation
closer
to
convergence
with
so-called
example-based
translation
,
following
current
trends
(
Marcu
,
2001
;
Och
,
2002
)
.
We
intend
to
explore
these
ideas
in
future
work
.
Acknowledgements
I
would
like
to
thank
Philip
Resnik
for
encouragement
,
thoughtful
discussions
and
wise
counsel
;
David
Chiang
for
providing
the
source
code
for
his
translation
system
;
and
Nitin
Madnani
,
Smaranda
Muresan
and
the
anonymous
reviewers
for
very
helpful
comments
on
earlier
drafts
of
this
paper
.
Any
errors
are
my
own
.
This
research
was
supported
in
part
by
ONR
MURI
Contract
FCPO.810548265
and
the
GALE
program
of
the
Defense
Advanced
Research
Projects
Agency
,
Contract
No.
HR0011-06-2-001
.
Any
opinions
,
findings
,
conclusions
or
recommendations
expressed
in
this
paper
are
those
of
the
author
and
do
not
necessarily
reflect
the
view
of
DARPA
.
