This
paper
reports
on
the
benefits
of
large-scale
statistical
language
modeling
in
machine
translation
.
A
distributed
infrastructure
is
proposed
which
we
use
to
train
on
up
to
2
trillion
tokens
,
resulting
in
language
models
having
up
to
300
billion
n-grams
.
It
is
capable
of
providing
smoothed
probabilities
for
fast
,
single-pass
decoding
.
We
introduce
a
new
smoothing
method
,
dubbed
Stupid
Backoff
,
that
is
inexpensive
to
train
on
large
data
sets
and
approaches
the
quality
of
Kneser-Ney
Smoothing
as
the
amount
of
training
data
increases
.
1
Introduction
where
{
hm
(
e
,
f
)
}
is
a
set
of
M
feature
functions
and
|
Am
|
a
set
of
weights
.
One
or
more
feature
functions
may
be
of
the
form
h
(
e
,
f
)
=
h
(
e
)
,
in
which
case
it
is
referred
to
as
a
language
model
.
We
focus
on
n-gram
language
models
,
which
are
trained
on
unlabeled
monolingual
text
.
As
a
general
rule
,
more
data
tends
to
yield
better
language
models
.
Questions
that
arise
in
this
context
include
:
(
1
)
How
might
one
build
a
language
model
that
allows
scaling
to
very
large
amounts
of
training
data
?
(
2
)
How
much
does
translation
performance
improve
as
the
size
ofthe
language
model
increases
?
(
3
)
Is
there
a
point
of
diminishing
returns
in
performance
as
a
function
of
language
model
size
?
This
paper
proposes
one
possible
answer
to
the
first
question
,
explores
the
second
by
providing
learning
curves
in
the
context
of
a
particular
statistical
machine
translation
system
,
and
hints
that
the
third
may
yet
be
some
time
in
answering
.
In
particular
,
it
proposes
a
distributed
language
model
training
and
deployment
infrastructure
,
which
allows
direct
and
efficient
integration
into
the
hypothesis-search
algorithm
rather
than
a
follow-on
re-scoring
phase
.
While
it
is
generally
recognized
that
two-pass
decoding
can
be
very
effective
in
practice
,
single-pass
decoding
remains
conceptually
attractive
because
it
eliminates
a
source
of
potential
information
loss
.
2
N-gram
Language
Models
Traditionally
,
statistical
language
models
have
been
designed
to
assign
probabilities
to
strings
of
words
(
or
tokens
,
which
may
include
punctuation
,
etc.
)
.
Let
wf
=
(
w1
;
.
.
.
,
wL
)
denote
a
string
of
L
tokens
over
a
fixed
vocabulary
.
An
n-gram
language
model
assigns
a
probability
to
wf
according
to
where
the
approximation
reflects
a
Markov
assumption
that
only
the
most
recent
n
—
1
tokens
are
relevant
when
predicting
the
next
word
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
858-867
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
For
any
substring
wj
of
wf
,
let
f
(
wj
)
denote
the
frequency
of
occurrence
of
that
substring
in
another
given
,
fixed
,
usually
very
long
target-language
string
called
the
training
data
.
The
maximum-likelihood
(
ML
)
probability
estimates
for
the
n-grams
are
given
by
their
relative
frequencies
While
intuitively
appealing
,
Eq
.
(
3
)
is
problematic
because
the
denominator
and
/
or
numerator
might
be
zero
,
leading
to
inaccurate
or
undefined
probability
estimates
.
This
is
termed
the
sparse
data
problem
.
For
this
reason
,
the
ML
estimate
must
be
modified
for
use
in
practice
;
see
(
Goodman
,
2001
)
for
a
discussion
of
n-gram
models
and
smoothing
.
In
principle
,
the
predictive
accuracy
of
the
language
model
can
be
improved
by
increasing
the
order
of
the
n-gram
.
However
,
doing
so
further
exacerbates
the
sparse
data
problem
.
The
present
work
addresses
the
challenges
of
processing
an
amount
of
training
data
sufficient
for
higher-order
n-gram
models
and
of
storing
and
managing
the
resulting
values
for
efficient
use
by
the
decoder
.
3
Related
Work
on
Distributed
Language
Models
The
topic
of
large
,
distributed
language
models
is
relatively
new
.
Recently
a
two-pass
approach
has
been
proposed
(
Zhang
et
al.
,
2006
)
,
wherein
a
lower-order
n-gram
is
used
in
a
hypothesis-generation
phase
,
then
later
the
K-best
of
these
hypotheses
are
re-scored
using
a
large-scale
distributed
language
model
.
The
resulting
translation
performance
was
shown
to
improve
appreciably
over
the
hypothesis
deemed
best
by
the
first-stage
system
.
The
amount
of
data
used
was
3
billion
words
.
More
recently
,
a
large-scale
distributed
language
model
has
been
proposed
in
the
contexts
of
speech
recognition
and
machine
translation
(
Emami
et
al.
,
2007
)
.
The
underlying
architecture
is
similar
to
(
Zhang
et
al.
,
2006
)
.
The
difference
is
that
they
integrate
the
distributed
language
model
into
their
machine
translation
decoder
.
However
,
they
don
't
report
details
of
the
integration
or
the
efficiency
of
the
approach
.
The
largest
amount
of
data
used
in
the
experiments
is
4
billion
words
.
Both
approaches
differ
from
ours
in
that
they
store
corpora
in
suffix
arrays
,
one
sub-corpus
per
worker
,
and
serve
raw
counts
.
This
implies
that
all
workers
need
to
be
contacted
for
each
n-gram
request
.
In
our
approach
,
smoothed
probabilities
are
stored
and
served
,
resulting
in
exactly
one
worker
being
contacted
per
n-gram
for
simple
smoothing
techniques
,
and
in
exactly
two
workers
for
smoothing
techniques
that
require
context-dependent
backoff
.
Furthermore
,
suffix
arrays
require
on
the
order
of
8
bytes
per
token
.
Directly
storing
5-grams
is
more
efficient
(
see
Section
7.2
)
and
allows
applying
count
cutoffs
,
further
reducing
the
size
of
the
model
.
4
Stupid
Backoff
State-of-the-art
smoothing
uses
variations
of
context-dependent
backoff
with
the
following
scheme
:
is
found
where
p
(
-
)
are
pre-computed
and
stored
probabilities
,
and
A
(
-
)
are
back-off
weights
.
As
examples
,
Kneser-Ney
Smoothing
(
Kneser
and
Ney
,
1995
)
,
Katz
Backoff
(
Katz
,
1987
)
and
linear
interpolation
(
Jelinek
and
Mercer
,
1980
)
can
be
expressed
in
this
scheme
(
Chen
and
Goodman
,
1998
)
.
The
recursion
ends
at
either
unigrams
or
at
the
uniform
distribution
for
zero-grams
.
We
introduce
a
similar
but
simpler
scheme
,
named
Stupid
Backoff1
,
that
does
not
generate
normalized
probabilities
.
The
main
difference
is
that
we
don
't
apply
any
discounting
and
instead
directly
use
the
relative
frequencies
(
S
is
used
instead
of
P
to
emphasize
that
these
are
not
probabilities
but
scores
)
:
otherwise
1
The
name
originated
at
a
time
when
we
thought
that
such
a
simple
scheme
cannot
possibly
be
good
.
Our
view
of
the
scheme
changed
,
but
the
name
stuck
.
with
N
being
the
size
of
the
training
corpus
.
Stupid
Backoff
is
inexpensive
to
calculate
in
a
distributed
environment
while
approaching
the
quality
of
Kneser-Ney
smoothing
for
large
amounts
of
data
.
The
lack
of
normalization
in
Eq
.
(
5
)
does
not
affect
the
functioning
of
the
language
model
in
the
present
setting
,
as
Eq
.
(
1
)
depends
on
relative
rather
than
absolute
feature-function
values
.
5
Distributed
Training
We
use
the
MapReduce
programming
model
(
Dean
and
Ghemawat
,
2004
)
to
train
on
terabytes
of
data
and
to
generate
terabytes
of
language
models
.
In
this
programming
model
,
a
user-specified
map
function
processes
an
input
key
/
value
pair
to
generate
a
set
of
intermediate
key
/
value
pairs
,
and
a
reduce
function
aggregates
all
intermediate
values
associated
with
the
same
key
.
Typically
,
multiple
map
tasks
operate
independently
on
different
machines
and
on
different
parts
of
the
input
data
.
Similarly
,
multiple
reduce
tasks
operate
independently
on
a
fraction
of
the
intermediate
data
,
which
is
partitioned
according
to
the
intermediate
keys
to
ensure
that
the
same
reducer
sees
all
values
for
a
given
key
.
For
additional
details
,
such
as
communication
among
machines
,
data
structures
and
application
examples
,
the
reader
is
referred
to
(
Dean
and
Ghemawat
,
2004
)
.
Our
system
generates
language
models
in
three
main
steps
,
as
described
in
the
following
sections
.
5.1
Vocabulary
Generation
Vocabulary
generation
determines
a
mapping
of
terms
to
integer
IDs
,
so
n-grams
can
be
stored
using
IDs
.
This
allows
better
compression
than
the
original
terms
.
We
assign
IDs
according
to
term
frequency
,
with
frequent
terms
receiving
small
IDs
for
efficient
variable-length
encoding
.
All
words
that
2The
value
of
0.4
was
chosen
empirically
based
on
good
results
in
earlier
experiments
.
Using
multiple
values
depending
on
the
n-gram
order
slightly
improves
results
.
occur
less
often
than
a
pre-determined
threshold
are
mapped
to
a
special
id
marking
the
unknown
word
.
The
vocabulary
generation
map
function
reads
training
text
as
input
.
Keys
are
irrelevant
;
values
are
text
.
It
emits
intermediate
data
where
keys
are
terms
and
values
are
their
counts
in
the
current
section
of
the
text
.
A
sharding
function
determines
which
shard
(
chunk
of
data
in
the
MapReduce
framework
)
the
pair
is
sent
to
.
This
ensures
that
all
pairs
with
the
same
key
are
sent
to
the
same
shard
.
The
reduce
function
receives
all
pairs
that
share
the
same
key
and
sums
up
the
counts
.
Simplified
,
the
map
,
sharding
and
reduce
functions
do
the
following
:
Emit
(
iter.first
,
iter.second
)
;
int
ShardForKey
(
string
key
,
int
nshards
)
{
return
Hash
(
key
)
%
nshards
;
Note
that
the
Reduce
function
emits
only
the
aggregated
value
.
The
output
key
is
the
same
as
the
intermediate
key
and
automatically
written
by
MapRe-duce
.
The
computation
of
counts
in
the
map
function
is
a
minor
optimization
over
the
alternative
of
simply
emitting
a
count
of
one
for
each
tokenized
word
in
the
array
.
Figure
1
shows
an
example
for
3
input
documents
and
2
reduce
shards
.
Which
reducer
a
particular
term
is
sent
to
is
determined
by
a
hash
function
,
indicated
by
text
color
.
The
exact
partitioning
ofthe
keys
is
irrelevant
;
important
is
thatall
pairs
with
the
same
key
are
sent
to
the
same
reducer
.
The
process
of
n-gram
generation
is
similar
to
vocabulary
generation
.
The
main
differences
are
that
now
words
are
converted
to
IDs
,
and
we
emit
n-grams
up
to
some
maximum
order
instead
of
single
Training
Corpus
will
land
on
time
will
be
In
town
n
any
town
in
any
state
In
the
land
Counting
in
the
map
phase
is
a
minor
optimization
;
see
text
.
Figure
1
:
Distributed
vocabulary
generation
.
words
.
A
simplified
map
function
does
the
following
:
Again
,
one
may
optimize
the
Map
function
by
first
aggregating
counts
over
some
section
ofthe
data
and
then
emit
the
aggregated
counts
instead
of
emitting
"
1
"
each
time
an
n-gram
is
encountered
.
The
reduce
function
is
the
same
as
for
vocabulary
generation
.
The
subsequent
step
of
language
model
generation
will
calculate
relative
frequencies
r
(
wi
|
wi-1+1
)
(
see
Eq
.
3
)
.
In
order
to
make
that
step
efficient
we
use
a
sharding
function
that
places
the
values
needed
for
the
numerator
and
denominator
into
the
same
shard
.
Computing
a
hash
function
on
just
the
first
words
of
n-grams
achieves
this
goal
.
The
required
n-grams
wi_ra+1
and
w
\
_ln+1
always
share
the
same
first
word
wi_n+1
,
except
for
unigrams
.
For
that
we
need
to
communicate
the
total
count
N
to
all
shards
.
Unfortunately
,
sharding
based
on
the
first
word
only
may
make
the
shards
very
imbalanced
.
Some
terms
can
be
found
at
the
beginning
of
a
huge
number
of
n-grams
,
e.g.
stopwords
,
some
punctuation
marks
,
or
the
beginning-of-sentence
marker
.
As
an
example
,
the
shard
receiving
n-grams
starting
with
the
beginning-of-sentence
marker
tends
to
be
several
times
the
average
size
.
Making
the
shards
evenly
sized
is
desirable
because
the
total
runtime
of
the
process
is
determined
by
the
largest
shard
.
The
shards
are
made
more
balanced
by
hashing
based
on
the
first
two
words
:
int
ShardForKey
(
string
key
,
int
nshards
)
{
string
prefix
=
FirstTwoWords
(
key
)
;
return
Hash
(
prefix
)
%
nshards
;
This
requires
redundantly
storing
unigram
counts
in
all
shards
in
order
to
be
able
to
calculate
relative
frequencies
within
shards
.
That
is
a
relatively
small
amount
of
information
(
a
few
million
entries
,
compared
to
up
to
hundreds
of
billions
of
n-grams
)
.
5.3
Language
Model
Generation
The
input
to
the
language
model
generation
step
is
the
output
of
the
n-gram
generation
step
:
n-grams
and
their
counts
.
All
information
necessary
to
calculate
relative
frequencies
is
available
within
individual
shards
because
of
the
sharding
function
.
That
is
everything
we
need
to
generate
models
with
Stupid
Backoff
.
More
complex
smoothing
methods
require
additional
steps
(
see
below
)
.
Backoff
operations
are
needed
when
the
full
n-gram
is
not
found
.
If
r
(
wilw
\
_ln+1
)
is
not
found
,
then
we
will
successively
look
for
r
(
wi
|
wi_n+2
)
,
r
(
wi
|
wi_
&gt;
1l+3
)
,
etc.
The
language
model
generation
step
shards
n-grams
on
their
last
two
words
(
with
unigrams
duplicated
)
,
so
all
backoff
operations
can
be
done
within
the
same
shard
(
note
that
the
required
n-grams
all
share
the
same
last
word
wi
)
.
5.4
Other
Smoothing
Methods
State-of-the-art
techniques
like
Kneser-Ney
Smoothing
or
Katz
Backoff
require
additional
,
more
expensive
steps
.
At
runtime
,
the
client
needs
to
additionally
request
up
to
4
backoff
factors
for
each
5-gram
requested
from
the
servers
,
thereby
multiplying
network
traffic
.
We
are
not
aware
of
a
method
that
always
stores
the
history
backoff
factors
on
the
same
shard
as
the
longer
n-gram
without
duplicating
a
large
fraction
of
the
entries
.
This
means
one
needs
to
contact
two
shards
per
n-gram
instead
of
just
one
for
Stupid
Backoff
.
Training
requires
additional
iterations
over
the
data
.
context
counting
unsmoothed
probs
and
interpol
.
weights
interpolated
probabilities
Input
key
Input
value
Intermediate
key
Sharding
wi-n+i
»
unigrams
duplicated
Intermediate
value
Output
value
Table
1
:
Extra
steps
needed
for
training
Interpolated
Kneser-Ney
Smoothing
Kneser-Ney
Smoothing
counts
lower-order
n-grams
differently
.
Instead
of
the
frequency
of
the
(
n
—
1
)
-
gram
,
it
uses
the
number
of
unique
single
word
contexts
the
(
n
—
1
)
-
gram
appears
in
.
We
use
fKN
(
•
)
to
jointly
denote
original
frequencies
for
the
highest
order
and
context
counts
for
lower
orders
.
After
the
n-gram
counting
step
,
we
process
the
n-grams
again
to
produce
these
quantities
.
This
can
be
done
similarly
to
the
n-gram
counting
using
a
MapReduce
(
Step
0
in
Table
1
)
.
The
most
commonly
used
variant
of
Kneser-Ney
smoothing
is
interpolated
Kneser-Ney
smoothing
,
defined
recursively
as
(
Chen
and
Goodman
,
1998
)
:
max
(
fKN
(
w.
where
D
is
a
discount
constant
and
|
A
(
wi
—
n+1
)
}
are
interpolation
weights
that
ensure
probabilities
sum
to
one
.
Two
additional
major
MapReduces
are
required
to
compute
these
values
efficiently
.
Table
1
describes
their
input
,
intermediate
and
output
keys
and
values
.
Note
that
output
keys
are
always
the
same
as
intermediate
keys
.
The
map
function
of
MapReduce
1
emits
n-gram
histories
as
intermediate
keys
,
so
the
reduce
function
gets
all
n-grams
with
the
same
history
at
the
same
time
,
generating
unsmoothed
probabilities
and
interpolation
weights
.
MapReduce
2
computes
the
interpolation
.
Its
map
function
emits
reversed
n-grams
as
intermediate
keys
(
hence
we
use
wi
—
n+1
in
the
table
)
.
All
unigrams
are
duplicated
in
every
reduce
shard
.
Because
the
reducer
function
receives
intermediate
keys
in
sorted
order
it
can
compute
smoothed
probabilities
for
all
n-gram
orders
with
simple
book-keeping
.
Katz
Backoff
requires
similar
additional
steps
.
The
largest
models
reported
here
with
Kneser-Ney
Smoothing
were
trained
on
31
billion
tokens
.
For
Stupid
Backoff
,
we
were
able
to
use
more
than
60
times
of
that
amount
.
6
Distributed
Application
Our
goal
is
to
use
distributed
language
models
integrated
into
the
first
pass
of
a
decoder
.
This
may
yield
better
results
than
n-best
list
or
lattice
rescor-ing
(
Ney
and
Ortmanns
,
1999
)
.
Doing
that
for
language
models
that
reside
in
the
same
machine
as
the
decoder
is
straight-forward
.
The
decoder
accesses
n-grams
whenever
necessary
.
This
is
inefficient
in
a
distributed
system
because
network
latency
causes
a
constant
overhead
on
the
order
of
milliseconds
.
Onboard
memory
is
around
10,000
times
faster
.
We
therefore
implemented
a
new
decoder
architecture
.
The
decoder
first
queues
some
number
of
requests
,
e.g.
1,000
or
10,000
n-grams
,
and
then
sends
them
together
to
the
servers
,
thereby
exploiting
the
fact
that
network
requests
with
large
numbers
of
n-grams
take
roughly
the
same
time
to
complete
as
requests
with
single
n-grams
.
The
n-best
search
of
our
machine
translation
decoder
proceeds
as
follows
.
It
maintains
a
graph
of
the
search
space
up
to
some
point
.
It
then
extends
each
hypothesis
by
advancing
one
word
position
in
the
source
language
,
resulting
in
a
candidate
extension
of
the
hypothesis
of
zero
,
one
,
or
more
additional
target-language
words
(
accounting
for
the
fact
that
variable-length
source-language
fragments
can
correspond
to
variable-length
target-language
fragments
)
.
In
a
traditional
setting
with
a
local
language
model
,
the
decoder
immediately
obtains
the
necessary
probabilities
and
then
(
together
with
scores
Figure
2
:
Illustration
of
decoder
graph
and
batch-querying
of
the
language
model
.
from
other
features
)
decides
which
hypotheses
to
keep
in
the
search
graph
.
When
using
a
distributed
language
model
,
the
decoder
first
tentatively
extends
all
current
hypotheses
,
taking
note
of
which
n-grams
are
required
to
score
them
.
These
are
queued
up
for
transmission
as
a
batch
request
.
When
the
scores
are
returned
,
the
decoder
re-visits
all
of
these
tentative
hypotheses
,
assigns
scores
,
and
re-prunes
the
search
graph
.
It
is
then
ready
for
the
next
round
of
extensions
,
again
involving
queuing
the
n-grams
,
waiting
for
the
servers
,
and
pruning
.
The
process
is
illustrated
in
Figure
2
assuming
a
trigram
model
and
a
decoder
policy
of
pruning
to
the
four
most
promising
hypotheses
.
The
four
active
hypotheses
(
indicated
by
black
disks
)
at
time
t
are
:
There
is
,
There
may
,
There
are
,
and
There
were
.
The
decoder
extends
these
to
form
eight
new
nodes
at
time
t
+
1
.
Note
that
one
of
the
arcs
is
labeled
e
,
indicating
that
no
target-language
word
was
generated
when
the
source-language
word
was
consumed
.
The
n-grams
necessary
to
score
these
eight
hypotheses
are
There
is
lots
,
There
is
many
,
There
may
be
,
There
are
lots
,
are
lots
of
,
etc.
These
are
queued
up
and
their
language-model
scores
requested
in
a
batch
manner
.
After
scoring
,
the
decoder
prunes
this
set
as
indicated
by
the
four
black
disks
at
time
t
+
1
,
then
extends
these
to
form
five
new
nodes
(
one
is
shared
)
at
time
t
+
2
.
The
n-grams
necessary
to
score
these
hypotheses
are
lots
ofpeople
,
lots
of
reasons
,
There
are
onlookers
,
etc.
Again
,
these
are
sent
to
the
server
together
,
and
again
after
scoring
the
graph
is
pruned
to
four
active
(
most
promising
)
hypotheses
.
The
alternating
processes
of
queuing
,
waiting
and
scoring
/
pruning
are
done
once
per
word
position
in
a
source
sentence
.
The
average
sentence
length
in
our
test
data
is
22
words
(
see
section
7.1
)
,
thus
we
have
23
rounds3
per
sentence
on
average
.
The
number
of
n-grams
requested
per
sentence
depends
on
the
decoder
settings
for
beam
size
,
re-ordering
window
,
etc.
As
an
example
for
larger
runs
reported
in
the
experiments
section
,
we
typically
request
around
150,000
n-grams
per
sentence
.
The
average
network
latency
per
batch
is
35
milliseconds
,
yielding
a
total
latency
of
0.8
seconds
caused
by
the
distributed
language
model
for
an
average
sentence
of
22
words
.
If
a
slight
reduction
in
translation
quality
is
allowed
,
then
the
average
network
latency
per
batch
can
be
brought
down
to
7
milliseconds
by
reducing
the
number
of
n-grams
requested
per
sentence
to
around
10,000
.
As
a
result
,
our
system
can
efficiently
use
the
large
distributed
language
model
at
decoding
time
.
There
is
no
need
for
a
second
pass
nor
for
n-best
list
rescoring
.
We
focused
on
machine
translation
when
describing
the
queued
language
model
access
.
However
,
it
is
general
enough
that
it
may
also
be
applicable
to
speech
decoders
and
optical
character
recognition
systems
.
7
Experiments
We
trained
5-gram
language
models
on
amounts
of
text
varying
from
13
million
to
2
trillion
tokens
.
The
data
is
divided
into
four
sets
;
language
models
are
trained
for
each
set
separately4
.
For
each
training
data
size
,
we
report
the
size
of
the
resulting
language
model
,
the
fraction
of
5-grams
from
the
test
data
that
is
present
in
the
language
model
,
and
the
BLEU
score
(
Papineni
et
al.
,
2002
)
obtained
by
the
machine
translation
system
.
For
smaller
training
sizes
,
we
have
also
computed
test-set
perplexity
using
Kneser-Ney
Smoothing
,
and
report
it
for
comparison
.
We
compiled
four
language
model
training
data
sets
,
listed
in
order
of
increasing
size
:
3One
additional
round
for
the
sentence
end
marker
.
4Experience
has
shown
that
using
multiple
,
separately
trained
language
models
as
feature
functions
in
Eq
(
1
)
yields
better
results
than
using
a
single
model
trained
on
all
data
.
+ldcnews
+webnews
x.
ro
o.
Figure
3
:
Number
of
n-grams
(
sum
of
unigrams
to
5-grams
)
for
varying
amounts
of
training
data
.
target
:
The
English
side
of
Arabic-English
parallel
data
provided
by
LDC5
(
237
million
tokens
)
.
Idcnews
:
This
is
a
concatenation
of
several
English
news
data
sets
provided
by
LDC6
(
5
billion
tokens
)
.
webnews
:
Data
collected
over
several
years
,
up
to
December
2005
,
from
web
pages
containing
predominantly
English
news
articles
(
31
billion
tokens
)
.
web
:
General
web
data
,
which
was
collected
in
January
2006
(
2
trillion
tokens
)
.
For
testing
we
use
the
"
NIST
"
part
of
the
2006
Arabic-English
NIST
MT
evaluation
set
,
which
is
not
included
in
the
training
data
listed
above7
.
It
consists
of
1797
sentences
of
newswire
,
broadcast
news
and
newsgroup
texts
with
4
reference
translations
each
.
The
test
set
is
used
to
calculate
translation
BLEU
scores
.
The
English
side
of
the
set
is
also
used
to
calculate
perplexities
and
n-gram
coverage
.
We
measure
the
size
of
language
models
in
total
number
of
n-grams
,
summed
over
all
orders
from
1
to
5
.
There
is
no
frequency
cutoff
on
the
n-grams
.
5http
:
/
/
www.nist.gov
/
speech
/
tests
/
mt
/
doc
/
LDCLicense-mt06.pdf
contains
a
list
of
parallel
resources
provided
by
LDC
.
6The
bigger
sets
included
are
LDC2005T12
(
Gigaword
,
2.5B
tokens
)
,
LDC93T3A
(
Tipster
,
500M
tokens
)
and
LDC2002T31
(
Acquaint
,
400M
tokens
)
,
plus
many
smaller
sets
.
7The
test
data
was
generated
after
1-Feb-2006
;
all
training
data
was
generated
before
that
date
.
vocab
size
#
machines
Table
2
:
Sizes
and
approximate
training
times
for
3
language
models
with
Stupid
Backoff
(
SB
)
and
Kneser-Ney
Smoothing
(
KN
)
.
There
is
,
however
,
a
frequency
cutoff
on
the
vocabulary
.
The
minimum
frequency
for
a
term
to
be
included
in
the
vocabulary
is
2
for
the
target
,
ldcnews
and
webnews
data
sets
,
and
200
for
the
web
data
set
.
All
terms
below
the
threshold
are
mapped
to
a
special
term
UNK
,
representing
the
unknown
word
.
Figure
3
shows
the
number
of
n-grams
for
language
models
trained
on
13
million
to
2
trillion
tokens
.
Both
axes
are
on
a
logarithmic
scale
.
The
right
scale
shows
the
approximate
size
ofthe
served
language
models
in
gigabytes
.
The
numbers
above
the
lines
indicate
the
relative
increase
in
language
model
size
:
x1.8
/
x2
means
that
the
number
of
n-grams
grows
by
a
factor
of
1.8
each
time
we
double
the
amount
of
training
data
.
The
values
are
similar
across
all
data
sets
and
data
sizes
,
ranging
from
1.6
to
1.8
.
The
plots
are
very
close
to
straight
lines
in
the
log
/
log
space
;
linear
least-squares
regression
finds
r2
&gt;
0.99
for
all
four
data
sets
.
The
web
data
set
has
the
smallest
relative
increase
.
This
can
be
at
least
partially
explained
by
the
higher
vocabulary
cutoff
.
The
largest
language
model
generated
contains
approx
.
300
billion
n-grams
.
Table
2
shows
sizes
and
approximate
training
times
when
training
on
the
full
target
,
webnews
,
and
web
data
sets
.
The
processes
run
on
standard
current
hardware
with
the
Linux
operating
system
.
Generating
models
with
Kneser-Ney
Smoothing
takes
6-7
times
longer
than
generating
models
with
Stupid
Backoff
.
We
deemed
generation
of
Kneser-Ney
models
on
the
web
data
as
too
expensive
and
therefore
excluded
it
from
our
experiments
.
The
estimated
runtime
for
that
is
approximately
one
week
on
1500
machines
.
target
KN
+ldcnews
KN
+webnews
KN
target
SB
+ldcnews
SB
+webnews
SB
+web
SB
LM
training
data
size
in
million
tokens
Figure
4
:
Perplexities
with
Kneser-Ney
Smoothing
Figure
5
:
BLEU
scores
for
varying
amounts
of
data
(
KN
PP
)
and
fraction
of
covered
5-grams
(
C5
)
.
using
Kneser-Ney
(
KN
)
and
Stupid
Backoff
(
SB
)
.
7.3
Perplexity
and
n-Gram
Coverage
A
standard
measure
for
language
model
quality
is
perplexity
.
It
is
measured
on
test
data
T
IT
I.
This
is
the
inverse
of
the
average
conditional
probability
of
a
next
word
;
lower
perplexities
are
better
.
Figure
4
shows
perplexities
for
models
with
Kneser-Ney
smoothing
.
Values
range
from
280.96
for
13
million
to
222.98
for
237
million
tokens
target
data
and
drop
nearly
linearly
with
data
size
(
r2
=
0.998
)
.
Perplexities
for
ldcnews
range
from
351.97
to
210.93
and
are
also
close
to
linear
(
r2
=
0.987
)
,
while
those
for
webnews
data
range
from
221.85
to
164.15
and
flatten
out
near
the
end
.
Perplexities
are
generally
high
and
may
be
explained
by
the
mixture
of
genres
in
the
test
data
(
newswire
,
broadcast
news
,
newsgroups
)
while
our
training
data
is
predominantly
written
news
articles
.
Other
held-out
sets
consisting
predominantly
of
newswire
texts
receive
lower
perplexities
by
the
same
language
models
,
e.g.
,
using
the
full
ldcnews
model
we
find
perplexities
of
143.91
for
the
NIST
MT
2005
evaluation
set
,
and
149.95
for
the
NIST
MT
2004
set
.
Note
that
the
perplexities
of
the
different
language
models
are
not
directly
comparable
because
they
use
different
vocabularies
.
We
used
a
fixed
frequency
cutoff
,
which
leads
to
larger
vocabularies
as
the
training
data
grows
.
Perplexities
tend
to
be
higher
with
larger
vocabularies
.
Perplexities
cannot
be
calculated
for
language
models
with
Stupid
Backoff
because
their
scores
are
not
normalized
probabilities
.
In
order
to
nevertheless
get
an
indication
of
potential
quality
improvements
with
increased
training
sizes
we
looked
at
the
5-gram
coverage
instead
.
This
is
the
fraction
of
5-grams
in
the
test
data
set
that
can
be
found
in
the
language
model
training
data
.
A
higher
coverage
will
result
in
a
better
language
model
if
(
as
we
hypothesize
)
estimates
for
seen
events
tend
to
be
better
than
estimates
for
unseen
events
.
This
fraction
grows
from
0.06
for
13
million
tokens
to
0.56
for
2
trillion
tokens
,
meaning
56
%
of
all
5-grams
in
the
test
data
are
known
to
the
language
model
.
Increase
in
coverage
depends
on
the
training
data
set
.
Within
each
set
,
we
observe
an
almost
constant
growth
(
correlation
r2
&gt;
0.989
for
all
sets
)
with
each
doubling
of
the
training
data
as
indicated
by
numbers
next
to
the
lines
.
The
fastest
growth
occurs
for
webnews
data
(
+0.038
for
each
doubling
)
,
the
slowest
growth
for
target
data
(
+0.022
/
x2
)
.
7.4
Machine
Translation
Results
We
use
a
state-of-the-art
machine
translation
system
for
translating
from
Arabic
to
English
that
achieved
a
competitive
BLEU
score
of
0.4535
on
the
Arabic
-
translation
evaluation8
.
Beam
size
and
re-ordering
window
were
reduced
in
order
to
facilitate
a
large
See
http
:
/
/
www.nist.gov
/
speech
/
tests
/
mt
/
mt06eval_official^results
.
html
for
more
results
.
number
of
experiments
.
Additionally
,
our
NIST
evaluation
system
used
a
mixture
of5
,
6
,
and
7-gram
models
with
optimized
stupid
backoff
factors
for
each
order
,
while
the
learning
curve
presented
here
uses
a
fixed
order
of
5
and
a
single
fixed
backoff
factor
.
Together
,
these
modifications
reduce
the
BLEU
score
by
l.49
BLEU
points
(
BP
)
9
at
the
largest
training
size
.
We
then
varied
the
amount
of
language
model
training
data
from
l3
million
to
2
trillion
tokens
.
All
other
parts
of
the
system
are
kept
the
same
.
Results
are
shown
in
Figure
5
.
The
first
part
of
the
curve
uses
target
data
for
training
the
language
model
.
With
Kneser-Ney
smoothing
(
KN
)
,
the
BLEU
score
improves
from
0.3559
for
l3
million
tokens
to
0.3832
for
237
million
tokens
.
At
such
data
sizes
,
Stupid
Backoff
(
SB
)
with
a
constant
backoff
parameter
a
=
O.4
is
around
l
BP
worse
than
KN
.
On
average
,
one
gains
0.62
BP
for
each
doubling
of
the
training
data
with
KN
,
and
0.66
BP
per
doubling
with
SB
.
Differences
of
more
than
0.5l
BP
are
statistically
significant
at
the
O.O5
level
using
bootstrap
resampling
(
Noreen
,
l989
;
Koehn
,
2004
)
.
We
then
add
a
second
language
model
using
ldc-news
data
.
The
first
point
for
ldcnews
shows
a
large
improvement
of
around
l.4
BP
over
the
last
point
for
target
for
both
KN
and
SB
,
which
is
approximately
twice
the
improvement
expected
from
doubling
the
amount
of
data
.
This
seems
to
be
caused
by
adding
a
new
domain
and
combining
two
models
.
After
that
,
we
find
an
improvement
of
0.56-0.70
BP
for
each
doubling
of
the
ldcnews
data
.
The
gap
between
Kneser-Ney
Smoothing
and
Stupid
Backoff
narrows
,
starting
with
a
difference
of
0.85
BP
and
ending
with
a
not
significant
difference
of
0.24
BP
.
Adding
a
third
language
models
based
on
webnews
data
does
not
show
a
jump
at
the
start
of
the
curve
.
We
see
,
however
,
steady
increases
of
0.39-0.5l
BP
per
doubling
.
The
gap
between
Kneser-Ney
and
Stupid
Backoff
is
gone
,
all
results
with
Stupid
Backoff
are
actually
better
than
Kneser-Ney
,
but
the
differences
are
not
significant
.
We
then
add
a
fourth
language
model
based
on
web
data
and
Stupid
Backoff
.
Generating
Kneser-Ney
models
for
these
data
sizes
is
extremely
expensive
and
is
therefore
omitted
.
The
fourth
model
91
BP
=
0.01
BLEU
.
We
show
system
scores
as
BLEU
,
differences
as
BP
.
shows
a
small
but
steady
increase
of
0.15
BP
per
doubling
,
surpassing
the
best
Kneser-Ney
model
(
trained
on
less
data
)
by
0.82
BP
at
the
largest
size
.
Goodman
(
2001
)
observed
that
Kneser-Ney
Smoothing
dominates
other
schemes
over
a
broad
range
of
conditions
.
Our
experiments
confirm
this
advantage
at
smaller
language
model
sizes
,
but
show
the
advantage
disappears
at
larger
data
sizes
.
The
amount
of
benefit
from
doubling
the
training
size
is
partly
determined
by
the
domains
of
the
data
sets10
.
The
improvements
are
almost
linear
on
the
log
scale
within
the
sets
.
Linear
least-squares
regression
shows
correlations
r2
&gt;
0.96
for
all
sets
and
both
smoothing
methods
,
thus
we
expect
to
see
similar
improvements
when
further
increasing
the
sizes
.
8
Conclusion
A
distributed
infrastructure
has
been
described
to
train
and
apply
large-scale
language
models
to
machine
translation
.
Experimental
results
were
presented
showing
the
effect
of
increasing
the
amount
of
training
data
to
up
to
2
trillion
tokens
,
resulting
in
a
5-gram
language
model
size
ofup
to
300
billion
n-grams
.
This
represents
a
gain
of
about
two
orders
of
magnitude
in
the
amount
of
training
data
that
can
be
handled
over
that
reported
previously
in
the
literature
(
or
three-to-four
orders
of
magnitude
,
if
one
considers
only
single-pass
decoding
)
.
The
infrastructure
is
capable
of
scaling
to
larger
amounts
of
training
data
and
higher
n-gram
orders
.
The
technique
is
made
efficient
by
judicious
batching
of
score
requests
by
the
decoder
in
a
serverclient
architecture
.
A
new
,
simple
smoothing
technique
well-suited
to
distributed
computation
was
proposed
,
and
shown
to
perform
as
well
as
more
sophisticated
methods
as
the
size
of
the
language
model
increases
.
Significantly
,
we
found
that
translation
quality
as
indicated
by
BLEU
score
continues
to
improve
with
increasing
language
model
size
,
at
even
the
largest
sizes
considered
.
This
finding
underscores
the
value
of
being
able
to
train
and
apply
very
large
language
models
,
and
suggests
that
further
performance
gains
may
be
had
by
pursuing
this
direction
further
.
10There
is
also
an
effect
of
the
order
in
which
we
add
the
models
.
As
an
example
,
web
data
yields
+0.43
BP
/
x2
when
added
as
the
second
model
.
A
discussion
of
this
effect
is
omitted
due
to
space
limitations
.
