It
is
possible
to
reduce
the
bulk
of
phrase-tables
for
Statistical
Machine
Translation
using
a
technique
based
on
the
significance
testing
of
phrase
pair
co-occurrence
in
the
parallel
corpus
.
The
savings
can
be
quite
substantial
(
up
to
90
%
)
and
cause
no
reduction
in
BLEU
score
.
In
some
cases
,
an
improvement
in
BLEU
is
obtained
at
the
same
time
although
the
effect
is
less
pronounced
if
state-of-the-art
phrasetable
smoothing
is
employed
.
1
Introduction
An
important
part
of
the
process
of
Statistical
Machine
Translation
(
SMT
)
involves
inferring
a
large
table
of
phrase
pairs
that
are
translations
of
each
other
from
a
large
corpus
of
aligned
sentences
.
These
phrase
pairs
together
with
estimates
of
conditional
probabilities
and
useful
feature
weights
,
called
collectively
a
phrasetable
,
are
used
to
match
a
source
sentence
to
produce
candidate
translations
.
The
choice
of
the
best
translation
is
made
based
on
the
combination
of
the
probabilities
and
feature
weights
,
and
much
discussion
has
been
made
of
how
to
make
the
estimates
of
probabilites
,
how
to
smooth
these
estimates
,
and
what
features
are
most
useful
for
discriminating
among
the
translations
.
However
,
a
cursory
glance
at
phrasetables
produced
often
suggests
that
many
of
the
translations
are
wrong
or
will
never
be
used
in
any
translation
.
On
the
other
hand
,
most
obvious
ways
of
reducing
the
bulk
usually
lead
to
a
reduction
in
translation
quality
as
measured
by
BLEU
score
.
This
has
led
to
an
impression
that
these
pairs
must
contribute
something
in
the
grand
scheme
of
things
and
,
certainly
,
more
data
is
better
than
less
.
Nonetheless
,
this
bulk
comes
at
a
cost
.
Large
tables
lead
to
large
data
structures
that
require
more
resources
and
more
time
to
process
and
,
more
importantly
,
effort
directed
in
handling
large
tables
could
likely
be
more
usefully
employed
in
more
features
or
more
sophisticated
search
.
In
this
paper
,
we
show
that
it
is
possible
to
prune
phrasetables
using
a
straightforward
approach
based
on
significance
testing
,
that
this
approach
does
not
adversely
affect
the
quality
of
translation
as
measured
by
BLEU
score
,
and
that
savings
in
terms
of
number
of
discarded
phrase
pairs
can
be
quite
substantial
.
Even
more
surprising
,
pruning
can
actually
raise
the
BLEU
score
although
this
phenomenon
is
less
prominent
if
state
of
the
art
smoothing
of
phrasetable
probabilities
is
employed
.
Section
2
reviews
the
basic
ideas
of
Statistical
Machine
Translation
as
well
as
those
of
testing
significance
of
associations
in
two
by
two
contingency
tables
departing
from
independence
.
From
this
,
a
filtering
algorithm
will
be
described
that
keeps
only
phrase
pairs
that
pass
a
significance
test
.
Section
3
outlines
a
number
of
experiments
that
demonstrate
the
phenomenon
and
measure
its
magnitude
.
Section
4
presents
the
results
of
these
experiments
.
The
paper
concludes
with
a
summary
of
what
has
been
learned
and
a
discussion
of
continuing
work
that
builds
on
these
ideas
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
961-915
,
Prague
,
June
2001
.
©
2001
Association
for
Computational
Linguistics
2
Background
Theory
2.1
Our
Approach
to
Statistical
Machine
Translation
We
define
a
phrasetable
as
a
set
of
source
phrases
(
n-grams
)
s
and
their
translations
(
m-grams
)
t
,
along
with
associated
translation
probabilities
p
(
s
|
s
)
and
p
(
s
|
s
)
.
These
conditional
distributions
are
derived
from
the
joint
frequencies
c
(
s
,
t
)
of
source
/
target
n
,
m-grams
observed
in
a
word-aligned
parallel
corpus
.
These
joint
counts
are
estimated
using
the
phrase
induction
algorithm
described
in
(
Koehn
et
al.
,
2003
)
,
with
symmetrized
word
alignments
generated
using
IBM
model
2
(
Brown
et
al.
,
1993
)
.
Phrases
are
limited
to
8
tokens
in
length
(
n
,
m
&lt;
8
)
.
Given
a
source
sentence
s
,
our
phrase-based
SMT
system
tries
to
find
the
target
sentence
t
that
is
the
most
likely
translation
of
s.
To
make
search
more
efficient
,
we
use
the
Viterbi
approximation
and
seek
the
most
likely
combination
of
t
and
its
alignment
a
with
s
,
rather
than
just
the
most
likely
t
:
where
a
=
(
Si
,
ti
,
ji
)
,
(
Sk,1k,3k
)
;
sk
are
target
phrases
such
that
t
=
t
i.
.
.
t
K
;
sk
are
source
phrases
such
that
s
=
s
j1
.
.
.
sjK
;
and
sk
is
the
translation
of
the
kth
target
phrase
t
k.
2002
)
.
Phrase
translation
model
probabilities
are
features
of
the
form
:
i.e.
,
we
assume
that
the
phrases
sk
specified
by
a
are
conditionally
independent
,
and
depend
only
on
their
aligned
phrases
t
k.
The
"
forward
"
phrase
probabilities
pfys
)
are
not
used
as
features
,
but
only
as
a
filter
on
the
set
of
possible
translations
:
for
each
source
phrase
s
that
matches
some
ngram
in
s
,
only
the
30
top-ranked
translations
t
according
to
pfys
)
are
retained
.
One
of
the
reviewers
has
pointed
out
correctly
that
taking
only
the
top
30
translations
will
interact
with
the
subject
under
study
;
however
,
this
pruning
technique
has
been
used
as
a
way
of
controlling
the
width
of
our
beam
search
and
rebalancing
search
parameters
would
have
complicated
this
study
and
taken
it
away
from
our
standard
practice
.
The
phrase
translation
model
probabilities
are
smoothed
according
to
one
of
several
techniques
as
described
in
(
Foster
et
al.
,
2006
)
and
identified
in
the
discussion
below
.
2.2
Significance
testing
using
two
by
two
contingency
tables
Each
phrase
pair
can
be
thought
of
as
am
n
,
m-gram
(
s
,
t
)
where
s
is
an
n-gram
from
the
source
side
of
the
corpus
and
t
is
an
m-gram
from
the
target
side
of
the
corpus
.
We
then
define
:
C
(
s
,
t
)
as
the
number
of
parallel
sentences
that
contain
one
or
more
occurrences
of
s
on
the
source
side
and
t
on
the
target
side
;
C
(
s
)
the
number
of
parallel
sentences
that
contain
one
or
more
occurrences
of
s
on
the
source
side
;
and
C
(
t
)
the
number
of
parallel
sentences
that
contain
one
or
more
occurrences
of
t
on
the
target
side
.
Together
with
N
,
the
number
of
parallel
sentences
,
we
have
enough
information
to
draw
up
a
two
by
two
contingency
table
representing
the
unconditional
relationship
between
s
and
1
This
table
is
shown
in
Table
A
standard
statistical
technique
used
to
assess
the
importance
of
an
association
represented
by
a
contingency
table
involves
calculating
the
probability
that
the
observed
table
or
one
that
is
more
extreme
could
occur
by
chance
assuming
a
model
of
independence
.
This
is
called
a
significance
test
.
Introductory
statistics
texts
describe
one
such
test
called
the
Chi-squared
test
.
There
are
other
tests
that
more
accurately
apply
to
our
small
tables
with
only
two
rows
and
columns
.
Table
1
:
Two
by
two
contingency
table
for
s
and
t
In
particular
,
Fisher
's
exact
test
calculates
probability
of
the
observed
table
using
the
hypergeometric
distibution
.
The
p-value
associated
with
our
observed
table
is
then
calculated
by
summing
probabilities
for
tables
that
have
a
larger
C
(
s
,
s
)
)
.
This
probability
is
interpreted
as
the
probability
of
observing
by
chance
an
association
that
is
at
least
as
strong
as
the
given
one
and
hence
its
significance
.
Agresti
(
1996
)
provides
an
excellent
introduction
to
this
topic
and
the
general
ideas
of
significance
testing
in
contingency
tables
.
Fisher
's
exact
test
of
significance
is
considered
a
gold
standard
since
it
represents
the
precise
probabilities
under
realistic
assumptions
.
Tests
such
as
the
Chi-squared
test
or
the
log-likelihood-ratio
test
(
yet
another
approximate
test
of
significance
)
depend
on
asymptotic
assumptions
that
are
often
not
valid
for
small
counts
.
Note
that
the
count
C
(
s
,
t
)
can
be
larger
or
smaller
than
c
(
s
,
s
)
discussed
above
.
In
most
cases
,
it
will
be
larger
,
because
it
counts
all
co-occurrences
of
s
with
s
rather
than
just
those
that
respect
the
word
alignment
.
It
can
be
smaller
though
because
multiple
co-occurrences
can
occur
within
a
single
aligned
sentence
pair
and
be
counted
multiple
times
in
c
(
s
,
s
)
.
On
the
other
hand
,
C
(
s
,
t
)
will
not
count
all
of
the
possible
ways
that
an
n
,
m-gram
match
can
occur
within
a
single
sentence
pair
;
it
will
count
the
match
only
once
per
sentence
pair
in
which
it
occurs
.
Moore
(
2004
)
discusses
the
use
of
significance
testing
of
word
associations
using
the
log-likelihood-ratio
test
and
Fisher
's
exact
test
.
He
shows
that
Fisher
's
exact
test
is
often
a
practical
method
if
a
number
of
techniques
are
followed
:
1
.
approximating
the
logarithms
of
factorials
using
commonly
available
numerical
approximations
to
the
log
gamma
function
,
2
.
using
a
well-known
recurrence
for
the
hyperge-ometic
distribution
,
3
.
noting
that
few
terms
usually
need
to
be
summed
,
and
4
.
observing
that
convergence
is
usually
rapid
.
2.3
Significance
pruning
The
idea
behind
significance
pruning
of
phrasetables
is
that
not
all
of
the
phrase
pairs
in
a
phrasetable
are
equally
supported
by
the
data
and
that
many
of
the
weakly
supported
pairs
could
be
removed
because
:
1
.
the
chance
of
them
occurring
again
might
be
low
,
and
2
.
their
occurrence
in
the
given
corpus
may
be
the
result
of
an
artifact
(
a
combination
of
effects
where
several
estimates
artificially
compensate
for
one
another
)
.
This
concept
is
usually
referred
to
as
overfit
since
the
model
fits
aspects
of
the
training
data
that
do
not
lead
to
improved
prediction
.
Phrase
pairs
that
cannot
stand
on
their
own
by
demonstrating
a
certain
level
of
significance
are
suspect
and
removing
them
from
the
phrasetable
may
be
beneficial
in
terms
of
reducing
the
size
of
data
structures
.
This
will
be
shown
to
be
the
case
in
rather
general
terms
.
Note
that
this
pruning
may
and
quite
often
will
remove
all
of
the
candidate
translations
for
a
source
phrase
.
This
might
seem
to
be
a
bad
idea
but
it
must
be
remembered
that
deleting
longer
phrases
will
allow
combinations
of
shorter
phrases
to
be
used
and
these
might
have
more
and
better
translations
from
the
corpus
.
Here
is
part
of
the
intuition
about
how
phrasetable
smoothing
may
interact
with
phrasetable
pruning
:
both
are
discouraging
longer
but
infrequent
phrases
from
the
corpus
in
favour
ofcombinations
of
more
frequent
,
shorter
phrases
.
Because
the
probabilities
involved
below
will
be
so
incredibly
tiny
,
we
will
work
instead
with
the
negative
of
the
natural
logs
of
the
probabilities
.
Thus
instead
of
selecting
phrase
pairs
with
a
p-value
less
than
exp
(
—
20
)
,
we
will
select
phrase
pairs
with
a
negative-log-p-value
greater
than
20
.
This
has
the
advantage
of
working
with
ordinary-sized
numbers
and
the
happy
convention
that
bigger
means
more
pruning
.
An
important
special
case
of
a
table
occurs
when
a
phrase
pair
occurs
exactly
once
in
the
corpus
,
and
each
of
the
component
phrases
occurs
exactly
once
in
its
side
of
the
parallel
corpus
.
These
phrase
pairs
will
be
referred
to
as
1-1-1
phrase
pairs
and
the
corresponding
tables
will
be
called
1-1-1
contingency
tables
because
C
(
s
)
=
1
,
C
(
s
)
=
1
,
and
C
(
s
J
)
=
1
.
Moore
(
2004
)
comments
that
the
p-value
for
these
tables
under
Fisher
's
exact
test
is
1
/
N.
Since
we
are
using
thresholds
of
the
negative
logarithm
of
the
p-value
,
the
value
a
=
log
(
N
)
is
a
useful
threshold
to
consider
.
In
particular
,
a
+
e
(
where
e
is
an
appropriately
small
positive
number
)
is
the
smallest
threshold
that
results
in
none
of
the
1-1-1
phrase
pairs
being
included
.
Similarly
,
a
—
e
is
the
largest
threshold
that
results
in
all
of
the
1-1-1
phrase
pairs
being
included
.
Because
1-1-1
phrase
pairs
can
make
up
a
large
part
of
the
phrase
table
,
this
is
important
observation
for
its
own
sake
.
ing
the
greatest
significance
(
lowest
p-value
)
is
the
1-1-1
table
,
using
the
threshold
of
a
+
e
can
be
used
to
exclude
all
of
the
phrase
pairs
occurring
exactly
once
(
C
(
s
,
s
)
=
1
)
.
The
common
strategy
of
deleting
all
of
the
1-count
phrase
pairs
is
very
similar
in
effect
to
the
use
of
the
a
+
e
threshold
.
3
Experiments
The
corpora
used
for
most
of
these
experiments
are
publicly
available
and
have
been
used
for
a
number
of
comparative
studies
(
Workshop
on
Statistical
Machine
Translation
,
2006
)
.
Provided
as
part
of
the
materials
for
the
shared
task
are
parallel
corpora
for
French-English
,
Spanish-English
,
and
German-English
as
well
as
language
models
for
English
,
French
,
Spanish
,
and
German
.
These
are
all
based
on
the
Europarl
resources
(
Europarl
,
2003
)
.
The
only
change
made
to
these
corpora
was
to
convert
them
to
lowercase
and
to
Unicode
UTF-8
.
Phrasetables
were
produced
by
symmetrizing
IBM2
conditional
probabilities
as
described
above
.
The
phrasetables
were
then
used
as
a
list
of
n
,
m-grams
for
which
counts
C
(
s
,
s
)
,
C
(
s
)
,
and
C
(
t
)
were
obtained
.
Negative-log-p-values
under
Fisher
's
exact
test
were
computed
for
each
of
the
phrase
pairs
in
the
phrasetable
and
the
entry
was
censored
if
the
negative-log-p-value
for
the
test
was
below
the
pruning
threshold
.
The
entries
that
are
kept
are
ones
that
are
highly
significant
.
A
number
of
combinations
involving
many
different
pruning
thresholds
were
considered
:
no
pruning
,
10
,
a
—
e
,
a
+
e
,
15,20,25,50,100
,
and
1000
.
Inaddition
,
a
number
of
different
phrasetable
smoothing
algorithms
were
used
:
no
smoothing
,
Good-Turing
smoothing
,
Kneser-Ney
3
parameter
smoothing
and
the
loglinear
mixture
involving
two
features
called
Zens-Ney
(
Foster
et
al.
,
2006
)
.
To
test
the
effects
of
significance
pruning
on
larger
corpora
,
a
series
of
experiments
was
run
on
a
much
larger
corpus
based
on
that
distributed
for
MT06
Chinese-English
(
NIST
MT
,
2006
)
.
Since
the
objective
was
to
assess
how
the
method
scaled
we
used
our
preferred
phrasetable
smoothing
technique
of
BLEU
by
Pruning
Threshold
Table
2
:
Corpus
Sizes
and
a
Values
Phrasetable
Size
by
Pruning
Threshold
BLEU
by
Phrasetable
Size
—
-
QQQ
.
no
smoothing
number
of
parallel
sentences
a
Zens-Ney
and
separated
our
corpus
into
two
phrase-tables
,
one
based
on
the
UN
corpus
and
the
other
based
on
the
best
of
the
remaining
parallel
corpora
available
to
us
.
Different
pruning
thresholds
were
considered
:
no
pruning
,
14
,
16
,
18
,
20
,
and
25
.
In
addition
,
another
more
aggressive
method
of
pruning
was
attempted
.
Moore
points
out
,
correctly
,
that
phrase
pairs
that
occur
in
only
one
sentence
pair
,
(
C
(
s
,
s
)
=
1
)
,
are
less
reliable
and
might
require
more
special
treatment
.
These
are
all
pruned
automatically
at
thresholds
of
16
and
above
but
not
at
threshold
of
14
.
A
special
series
of
runs
was
done
for
threshold
14
with
all
of
these
singletons
removed
to
see
whether
at
these
thresholds
it
was
the
significance
level
or
the
pruning
of
phrase
pairs
with
(
C
(
s
,
t
)
=
1
)
that
was
more
important
.
This
is
identified
as
14
'
in
the
results
.
4
Results
The
results
of
the
experiments
are
described
in
Tables
2
through
6
.
Table
2
presents
the
sizes
of
the
various
parallel
corpora
showing
the
number
of
parallel
sentences
,
N
,
for
each
of
the
experiments
,
together
with
the
a
thresholds
(
a
=
log
(
N
)
)
.
Table
3
shows
the
sizes
of
the
phrasetables
that
result
from
the
various
pruning
thresholds
described
for
the
WMT06
data
.
It
is
clear
that
this
is
extremely
aggressive
pruning
at
the
given
levels
.
Table
4
shows
the
corresponding
phrasetable
sizes
for
the
large
corpus
Chinese-English
data
.
The
pruning
is
not
as
aggressive
as
for
the
WMT06
data
but
still
quite
sizeable
.
Tables
5
and
6
show
the
main
results
for
the
WMT06
and
the
Chinese-English
large
corpus
experiments
.
To
make
these
results
more
graphic
,
Figure
1
shows
the
French
—
&gt;
English
data
from
the
WMT06
results
in
the
form
of
three
graphs
.
Note
Table
3
:
WMT06
:
Distinct
phrase
pairs
by
pruning
threshold
Table
4
:
Chinese-English
:
Distinct
phrase
pairs
by
pruning
threshold
threshold
that
an
artificial
separation
of
1
BLEU
point
has
been
introduced
into
these
graphs
to
separate
them
.
Without
this
,
they
lie
on
top
of
each
other
and
hide
the
essential
point
.
In
compensation
,
the
scale
for
the
BLEU
co-ordinate
has
been
removed
.
These
results
are
summarized
in
the
following
subsections
.
In
tables
5
and
6
,
the
largest
BLEU
score
for
each
set
of
runs
has
been
marked
in
bold
font
.
In
addition
,
to
highlight
that
there
are
many
near
ties
for
largest
BLEU
,
all
BLEU
scores
that
are
within
0.1
of
the
best
are
also
marked
in
bold
.
When
this
is
done
it
becomes
clear
that
pruning
at
a
level
of
20
for
the
WMT06
runs
would
not
reduce
BLEU
in
most
cases
and
in
many
cases
would
actually
increase
it
.
A
pruning
threshold
of
20
corresponds
to
discarding
roughly
90
%
of
the
phrase
-
table
.
For
the
Chinese-English
large
corpus
runs
,
a
level
of
16
seems
to
be
about
the
best
with
a
small
increase
in
BLEU
and
a
60
%
—
70
%
&gt;
reduction
in
the
size
of
the
phrasetable
.
Another
view
of
this
can
be
taken
from
Tables
5
and
6
.
The
fraction
of
the
phrasetable
retained
is
a
more
or
less
simple
function
of
pruning
threshold
as
shown
in
Tables
3
and
4
.
By
including
the
percentages
in
Tables
5
and
6
,
we
can
see
that
BLEU
goes
up
as
the
fraction
approaches
between
20
%
and
30
%
.
This
seems
to
be
a
relatively
stable
observation
across
the
experiments
.
It
is
also
easily
explained
by
its
strong
relationship
to
pruning
threshold
.
Table
6
shows
that
this
is
not
just
a
small
corpus
phenomenon
.
There
is
a
sizeable
benefit
both
in
phrase-table
reduction
and
a
modest
improvement
to
BLEU
even
in
this
case
.
4.4
Is
this
just
the
same
as
phrasetable
smoothing
?
One
question
that
occurred
early
on
was
whether
this
improvement
in
BLEU
is
somehow
related
to
the
improvement
in
BLEU
that
occurs
with
phrasetable
smoothing
.
It
appears
that
the
answer
is
,
in
the
main
,
yes
,
although
there
is
definitely
something
else
going
on
.
It
is
true
that
the
benefit
in
terms
of
BLEU
is
lessened
for
better
types
of
phrasetable
smoothing
but
the
benefit
in
terms
of
the
reduction
in
bulk
holds
.
It
is
reassuring
to
see
that
no
harm
to
BLEU
is
done
by
removing
even
80
%
of
the
phrasetable
.
Another
question
that
came
up
is
the
role
of
phrase
pairs
that
occur
only
once
:
C
(
s
,
t
)
=
1
.
In
particular
as
discussed
above
,
the
most
significant
of
these
are
the
1-1-1
phrase
pairs
whose
components
also
only
occur
once
:
C
(
s
)
=
1
,
and
C
(
t
)
=
1
.
These
phrase
pairs
are
amazingly
frequent
in
the
phrase-tables
and
are
pruned
in
all
of
the
experiments
except
when
pruning
threshold
is
equal
to
14
.
The
Chinese-English
large
corpus
experiments
give
us
a
good
opportunity
to
show
that
significance
level
seems
to
be
more
an
issue
than
the
case
that
C
(
s
,
t
)
=
1
.
Note
that
we
could
have
kept
the
phrase
pairs
whose
marginal
counts
were
greater
than
one
but
most
of
these
are
of
lower
significance
and
likely
are
pruned
already
by
the
threshold
.
The
given
configuration
was
considered
the
most
likely
to
yield
a
benefit
and
its
poor
performance
led
to
the
whole
idea
being
put
aside
.
5
Conclusions
and
Continuing
Work
To
sum
up
,
the
main
conclusions
are
five
in
number
:
Phrasetables
produced
by
the
standard
Diag-And
method
(
Koehn
et
al.
,
2003
)
can
be
aggressively
pruned
using
significance
pruning
without
worsening
BLEU
.
If
phrasetable
smoothing
is
not
done
,
the
BLEU
score
will
improve
under
aggressive
significance
pruning
.
If
phrasetable
smoothing
is
done
,
the
improvement
is
small
or
negligible
but
there
is
still
no
loss
on
aggressive
pruning
.
The
preservation
of
BLEU
score
in
the
presence
of
large-scale
pruning
is
a
strong
effect
in
small
and
moderate
size
phrasetables
,
but
occurs
also
in
much
larger
phrasetables
.
In
larger
phrasetables
based
on
larger
corpora
,
the
percentage
of
the
table
that
can
be
discarded
appears
to
decrease
.
This
is
plausible
since
a
similar
effect
(
a
decrease
in
the
benefit
of
smoothing
)
has
been
noted
with
phrasetable
smoothing
(
Foster
et
al.
,
2006
)
.
Together
these
results
suggest
that
,
for
these
corpus
sizes
,
the
increase
in
the
number
of
strongly
supported
phrase
pairs
is
greater
than
the
increase
in
the
number
of
poorly
supported
pairs
,
which
agrees
with
intuition
.
Although
there
may
be
other
approaches
to
pruning
that
achieve
a
similar
effect
,
the
use
of
Fisher
's
exact
test
is
mathematically
and
conceptually
one
of
the
simplest
since
it
asks
a
question
separately
for
each
phrase
pair
:
"
Considering
this
phase
pair
in
isolation
of
any
other
analysis
on
the
corpus
,
could
it
have
occurred
plausibly
by
purely
random
processes
inherent
in
the
corpus
construction
?
"
If
the
answer
is
"
Yes
"
,
then
it
is
hard
to
argue
that
the
phrase
pair
is
an
association
of
general
applicability
from
the
evidence
in
this
corpus
alone
.
Note
that
the
removal
of
1-count
phrase
pairs
is
subsumed
by
significance
pruning
with
a
threshold
greater
than
a
and
many
of
the
other
simple
approaches
(
from
an
implementation
point
of
view
)
are
more
difficult
to
justify
as
simply
as
the
above
significance
test
.
Nonetheless
,
there
remains
work
to
do
in
determining
if
computationally
simpler
approaches
do
as
well
.
Moore
's
work
suggests
that
log-likelihood-ratio
would
be
a
cheaper
and
accurate
enough
alternative
,
for
example
.
We
will
now
return
to
the
interaction
of
the
selection
in
our
beam
search
of
the
top
30
candidates
based
on
forward
conditional
probabilities
.
This
will
affect
our
results
but
most
likely
in
the
following
manner
:
For
very
small
thresholds
,
the
beam
will
become
much
wider
and
the
search
will
take
much
longer
.
In
order
to
allow
the
experiments
to
complete
in
a
reasonable
time
,
other
means
will
need
to
be
employed
to
reduce
the
choices
.
This
reduction
will
also
interact
with
the
significance
pruning
but
in
a
less
understandable
manner
.
choices
and
so
there
will
be
no
effect
.
For
intermediate
thresholds
,
the
extra
pruning
might
reduce
BLEU
score
but
by
a
small
amount
because
most
of
the
best
choices
are
included
in
the
search
.
Using
thresholds
that
remove
most
of
the
phrase-table
would
no
doubt
qualify
as
large
thresholds
so
the
question
is
addressing
the
true
shape
of
the
curve
for
smaller
thresholds
and
not
at
the
expected
operating
levels
.
Nonetheless
,
this
is
a
subject
for
further
study
,
especially
as
we
consider
alternatives
to
our
"
filter
30
"
approach
for
managing
beam
width
.
There
are
a
number
of
important
ways
that
this
work
can
and
will
be
continued
.
The
code
base
for
taking
a
list
of
n
,
m-grams
and
computing
the
required
frequencies
for
signifance
evaluation
can
be
applied
to
related
problems
.
For
example
,
skip-n-grams
(
n-grams
that
allow
for
gaps
of
fixed
or
variable
size
)
may
be
studied
better
using
this
approach
leading
to
insight
about
methods
that
weakly
approximate
patterns
.
The
original
goal
of
this
work
was
to
better
understand
the
character
of
phrasetables
,
and
it
remains
a
useful
diagnostic
technique
.
It
will
hopefully
lead
to
more
understanding
of
what
it
takes
to
make
a
good
phrasetable
especially
for
languages
that
require
morphological
analysis
or
segmentation
to
produce
good
tables
using
standard
methods
.
The
negative-log-p-value
promises
to
be
a
useful
feature
and
we
are
currently
evaluating
its
merits
.
6
Acknowledgement
This
material
is
based
upon
work
supported
by
the
Defense
Advanced
Research
Projects
Agency
(
DARPA
)
under
Contract
No.
HR0011-06-C-0023
.
Any
opinions
,
findings
and
conclusions
or
recommendations
expressed
in
this
material
are
those
of
the
authors
and
do
not
necessarily
reflect
the
views
of
the
Defense
Advanced
Research
Projects
Agency
Alan
Agresti
.
1996
.
An
Introduction
to
Categorical
Data
Analysis
.
Wiley
.
Peter
F.
Brown
,
Stephen
A.
Delia
Pietra
,
Vincent
J.
Delia
Pietra
and
Robert
L.
Mercer
.
1993
.
The
Mathematics
of
Statistical
Machine
Translation
:
Parameter
estimation
.
Computational
Linguistics
,
19
(
2
)
:
263-312
,
June
.
Philipp
Koehn
2003
.
Europarl
:
A
Multilingual
Corpus
for
Evaluation
of
Machine
Translation
.
Unpublished
draft
.
see
http
:
/
/
www.iccs.inf.ed.ac.uk
/
~
pkoehn
/
publications
/
europarl.pdf
George
Foster
,
Roland
Kuhn
,
and
Howard
Johnson
.
2006
.
Phrasetable
Smoothing
for
Statistical
Machine
Translation
.
In
Proceedings
of
the
2006
Conference
on
Empirical
Methods
in
Natural
Language
Processing
,
Sydney
,
Australia
.
Reinhard
Kneser
and
Hermann
Ney
.
1995
.
Improved
backing-off
for
m-gram
language
modeling
.
In
Proceedings
of
the
International
Conference
on
Acoustics
,
Speech
,
and
Signal
Processing
(
ICASSP
)
1995
,
pages
181-184
,
Detroit
,
Michigan
.
IEEE
.
Philipp
Koehn
,
Franz
Josef
Och
,
and
Daniel
Marcu
.
Statistical
phrase-based
translation
.
In
Eduard
Hovy
,
editor
,
Proceedings
of
the
Human
Language
Technology
Conference
of
the
North
American
Chapter
ofthe
Association
for
Computational
Linguistics
,
pages
127-133
,
Edmonton
,
Alberta
,
Canada
,
May
.
NAACL
.
Robert
C.
Moore
.
On
Log-Likelihood-Ratios
and
the
Significance
of
Rare
Events
.
In
Proceedings
of
the
2004
Conference
on
Empirical
Methods
in
Natural
Language
Processing
,
Barcelona
,
Spain
.
Franz
Josef
Och
.
2003
.
Minimum
error
rate
training
for
statistical
machine
translation
.
In
Proceedings
ofthe
41th
Annual
Meeting
ofthe
Association
for
Computational
Linguistics
(
ACL
)
,
Sapporo
,
July
.
Kishore
Papineni
,
Salim
Roukos
,
Todd
Ward
,
and
Wei-Jing
Zhu
.
2001
.
BLEU
:
A
method
for
automatic
evaluation
of
Machine
Translation
.
Technical
Report
RC22176
,
IBM
,
September
.
Andreas
Stolcke
.
2002
.
SRILM
-
an
extensible
language
modeling
toolkit
.
In
Proceedings
of
the
7th
International
Conference
on
Spoken
Language
Processing
(
ICSLP
)
2002
,
Denver
,
Colorado
,
September
.
Richard
Zens
and
Hermann
Ney
.
2004
.
Improvements
in
phrase-based
statistical
machine
translation
.
In
Proceedings
ofHuman
Language
Technology
Conference
/
North
American
Chapter
of
the
ACL
,
Boston
,
May
.
Results
:
BLEU
by
type
of
smoothing
and
pruning
threshold
Good-Turing
Zens-Ney
Table
6
:
Chinese
Results
:
BLEU
by
pruning
threshold
Zens-Ney
Smoothing
applied
to
all
phrasetables
