We
address
the
problem
of
smoothing
translation
probabilities
in
a
bilingual
N-gram-based
statistical
machine
translation
system
.
It
is
proposed
to
project
the
bilingual
tuples
onto
a
continuous
space
and
to
estimate
the
translation
probabilities
in
this
representation
.
A
neural
network
is
used
to
perform
the
projection
and
the
probability
estimation
.
Smoothing
probabilities
is
most
important
for
tasks
with
a
limited
amount
of
training
material
.
We
consider
here
the
Btec
task
of
the
2006
Iwslt
evaluation
.
Improvements
in
all
official
automatic
measures
are
reported
when
translating
from
Italian
to
English
.
Using
a
continuous
space
model
for
the
translation
model
and
the
target
language
model
,
an
improvement
of
1.5
BLEU
on
the
test
data
is
observed
.
1
Introduction
The
goal
of
statistical
machine
translation
(
SMT
)
is
to
produce
a
target
sentence
e
from
a
source
sentence
f.
Among
all
possible
target
language
sentences
the
one
with
the
highest
probability
is
chosen
:
where
Pr
(
f
|
e
)
is
the
translation
model
and
Pr
(
e
)
is
the
target
language
model
.
This
approach
is
usually
referred
to
as
the
noisy
source-channel
approach
in
statistical
machine
translation
(
Brown
et
al.
,
1993
)
.
During
the
last
few
years
,
the
use
of
context
in
SMT
systems
has
provided
great
improvements
in
translation
.
SMT
has
evolved
from
the
original
word-based
approach
to
phrase-based
translation
systems
(
Och
et
al.
,
1999
;
Koehn
et
al.
,
2003
)
.
A
phrase
is
defined
as
a
group
of
source
words
f
that
should
be
translated
together
into
a
group
of
target
words
e.
The
translation
model
in
phrase-based
systems
includes
the
phrase
translation
probabilities
in
both
directions
,
i.e.
P
(
e
|
f
)
and
P
(
f
|
e
)
.
The
use
of
a
maximum
entropy
approach
simplifies
the
introduction
of
several
additional
models
explaining
the
translation
process
:
The
feature
functions
hi
are
the
system
models
and
the
Ai
weights
are
typically
optimized
to
maximize
a
scoring
function
on
a
development
set
(
Och
and
Ney
,
2002
)
.
The
phrase
translation
probabilities
P
(
e
|
f
)
and
P
(
f
|
e
)
are
usually
obtained
using
relative
frequency
estimates
.
Statistical
learning
theory
,
however
,
tells
us
that
relative
frequency
estimates
have
several
drawbacks
,
in
particular
high
variance
and
low
bias
.
Phrase
tables
may
contain
several
millions
of
entries
,
most
of
which
appear
only
once
or
twice
,
which
means
that
we
are
confronted
with
a
data
sparseness
problem
.
Surprisingly
,
there
seems
to
be
little
work
addressing
the
issue
of
smoothing
of
the
phrase
table
probabilities
.
On
the
other
hand
,
smoothing
of
relative
frequency
estimates
was
extensively
investigated
in
the
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
430-438
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
area
of
language
modeling
.
A
systematic
comparison
can
be
for
instance
found
in
(
Chen
and
Goodman
,
1999
)
.
Language
models
and
phrase
tables
have
in
common
that
the
probabilities
of
rare
events
may
be
overestimated
.
However
,
in
language
modeling
probability
mass
must
be
redistributed
in
order
to
account
for
the
unseen
n-grams
.
Generalization
to
unseen
events
is
less
important
in
phrase-based
SMT
systems
since
the
system
searches
only
for
the
best
segmentation
and
the
best
matching
phrase
pair
among
the
existing
ones
.
We
are
only
aware
of
one
work
that
performs
a
systematic
comparison
of
smoothing
techniques
in
phrase-based
machine
translation
systems
(
Foster
et
al.
,
2006
)
.
Two
types
of
phrase-table
smoothing
were
compared
:
black-box
and
glass-box
methods
.
Black-methods
do
not
look
inside
phrases
but
instead
treat
them
as
atomic
objects
.
By
these
means
,
all
the
methods
developed
for
language
modeling
can
be
used
.
Glass-box
methods
decompose
P
(
e
|
f
)
into
a
set
of
lexical
distributions
P
(
e
|
f
)
.
For
instance
,
it
was
suggested
to
use
IBM-1
probabilities
(
Och
et
al.
,
2004
)
,
or
other
lexical
translation
probabilities
(
Koehn
et
al.
,
2003
;
Zens
and
Ney
,
2004
)
.
Some
form
of
glass-box
smoothing
is
now
used
in
all
state-of-the-art
statistical
machine
translation
systems
.
Another
approach
related
to
phrase
table
smoothing
is
the
so-called
N-gram
translation
model
(
Marino
et
al.
,
2006
)
.
In
this
model
,
bilingual
tuples
are
used
instead
of
the
phrase
pairs
and
n-gram
probabilities
are
considered
rather
than
relative
frequencies
.
Therefore
,
smoothing
is
obtained
using
the
standard
techniques
developed
for
language
modeling
.
In
addition
,
a
context
dependence
of
the
phrases
is
introduced
.
On
the
other
hand
,
some
restrictions
on
the
segmentation
of
the
source
sentence
must
be
used
.
N-gram-based
translation
models
were
extensively
compared
to
phrase-based
systems
on
several
tasks
and
typically
achieve
comparable
performance
.
In
this
paper
we
propose
to
investigate
improved
smoothing
techniques
in
the
framework
of
the
N-gram
translation
model
.
Despite
the
undeniable
success
of
n-graam
back-off
models
,
these
techniques
have
several
drawbacks
from
a
theoretical
point
of
view
:
the
words
are
represented
in
a
discrete
space
,
the
vocabulary
.
This
prevents
"
true
interpolation
"
of
the
probabilities
of
unseen
n-grams
since
a
change
in
this
word
space
can
result
in
an
arbitrary
change
of
the
n-gram
probability
.
An
alternative
approach
is
based
on
a
continuous
representation
of
the
words
(
Bengio
et
al.
,
2003
)
.
The
basic
idea
is
to
convert
the
word
indices
to
a
continuous
representation
and
to
use
a
probability
estimator
operating
in
this
space
.
Since
the
resulting
distributions
are
smooth
functions
of
the
word
representation
,
better
generalization
to
unknown
n-grams
can
be
expected
.
Probability
estimation
and
interpolation
in
a
continuous
space
is
mathematically
well
understood
and
numerous
powerful
algorithms
are
available
that
can
perform
meaningful
interpolations
even
when
only
a
limited
amount
of
training
material
is
available
.
This
approach
was
successfully
applied
to
language
modeling
in
large
vocabulary
continuous
speech
recognition
(
Schwenk
,
2007
)
and
to
language
modeling
in
phrase-based
SMT
systems
(
Schwenk
et
al.
,
2006
)
.
In
this
paper
,
we
investigate
whether
this
approach
is
useful
to
smooth
the
probabilities
involved
in
the
bilingual
tuple
translation
model
.
Reliable
estimation
of
unseen
n-grams
is
very
important
in
this
translation
model
.
Most
of
the
trigram
tuples
encountered
in
the
development
or
test
data
were
never
seen
in
the
training
data
.
N-gram
hit
rates
are
reported
in
the
results
section
of
this
paper
.
We
report
experimental
results
for
the
Btec
corpus
as
used
in
the
2006
evaluations
of
the
international
workshop
on
spoken
language
translation
Iwslt
(
Paul
,
2006
)
.
This
task
provides
a
very
limited
amount
of
resources
in
comparison
to
other
tasks
like
the
translation
of
journal
texts
(
Nist
evaluations
)
or
of
parliament
speeches
(
Tc-Star
evaluations
)
.
Among
the
language
pairs
tested
in
this
years
evaluation
,
Italian
to
English
gave
the
best
BLEU
results
in
this
year
evaluation
.
The
better
the
translation
quality
is
,
the
more
it
is
challenging
to
outperform
it
without
adding
more
data
.
We
show
that
a
new
smoothing
technique
for
the
translation
model
achieves
a
significant
improvement
in
the
BLEU
score
for
a
state-of-the-art
statistical
translation
system
.
This
paper
is
organized
as
follows
.
In
the
next
section
we
first
describe
the
baseline
statistical
machine
translation
systems
.
Section
3
presents
the
architecture
and
training
algorithms
of
the
continuous
space
translation
model
and
section
4
summarizes
the
experimental
evaluation
.
The
paper
concludes
with
a
discussion
of
future
research
directions
.
2
N-gram-based
Translation
Model
The
N-gram-based
translation
model
has
been
derived
from
the
finite-state
perspective
;
more
specifically
,
from
the
work
of
Casacuberta
(
2001
)
.
However
,
different
from
it
,
where
the
translation
model
is
implemented
by
using
a
finite-state
transducer
,
the
N-gram-based
system
implements
a
bilingual
N-gram
model
.
It
actually
constitutes
a
language
model
of
bilingual
units
,
referred
to
as
tuples
,
which
approximates
the
joint
probability
between
source
and
target
languages
by
using
N-grams
,
such
as
described
by
the
following
equation
:
where
e
refers
to
target
,
f
to
source
and
(
e
,
f
)
k
to
the
kth
tuple
of
a
given
bilingual
sentence
pair
.
Bilingual
units
(
tuples
)
are
extracted
from
any
word-to-word
alignment
according
to
the
following
constraints
:
•
a
monotonic
segmentation
of
each
bilingual
sentence
pairs
is
produced
,
•
no
word
inside
the
tuple
is
aligned
to
words
outside
the
tuple
,
and
•
no
smaller
tuples
can
be
extracted
without
violating
the
previous
constraints
.
As
a
consequence
of
these
constraints
,
only
one
segmentation
is
possible
for
a
given
sentence
pair
.
Two
important
issues
regarding
this
translation
model
must
be
considered
.
First
,
it
often
occurs
that
a
large
number
of
single-word
translation
probabilities
are
left
out
of
the
model
.
This
happens
for
all
words
that
are
always
embedded
in
tuples
containing
two
or
more
words
,
then
no
translation
probability
for
an
independent
occurrence
of
these
embedded
words
will
exist
.
To
overcome
this
problem
,
the
tuple
trigram
model
is
enhanced
by
incorporating
1-gram
translation
probabilities
for
all
the
embedded
words
detected
during
the
tuple
extraction
step
.
These
1-gram
translation
probabilities
are
computed
from
the
intersection
of
both
the
source-to-target
and
the
target-to-source
alignments
.
The
second
issue
has
to
do
with
the
fact
that
some
words
linked
to
NULL
end
up
producing
tuples
with
NULL
source
sides
.
Since
no
NULL
is
actually
expected
to
occur
in
translation
inputs
,
this
type
of
tuple
is
not
allowed
.
Any
target
word
that
is
linked
to
NULL
is
attached
either
to
the
word
that
precedes
or
the
word
that
follows
it
.
To
determine
this
,
an
approach
based
on
the
IBM1
probabilities
was
used
,
as
described
in
(
Marino
et
al.
,
2006
)
.
2.1
Additional
features
The
following
feature
functions
were
used
in
the
N-gram-based
translation
system
:
•
A
target
language
model
.
In
the
baseline
system
,
this
feature
consists
of
a
4-gram
back-off
model
of
words
,
which
is
trained
from
the
target
side
of
the
bilingual
corpus
.
•
A
source-to-target
lexicon
model
and
a
target-to-source
lexicon
model
.
These
feature
,
which
are
based
on
the
lexical
parameters
of
the
IBM
Model
1
,
provide
a
complementary
probability
for
each
tuple
in
the
translation
table
.
•
A
word
bonus
function
.
This
feature
introduces
a
bonus
based
on
the
number
of
target
words
contained
in
the
partial-translation
hypothesis
.
It
is
used
to
compensate
for
the
system
's
preference
for
short
output
sentences
.
All
these
models
are
combined
in
the
decoder
.
Additionally
,
the
decoder
allows
for
a
non-monotonic
search
with
the
following
distorsion
model
.
•
A
word
distance-based
distorsion
model
.
where
dk
is
the
distance
between
the
first
word
of
the
kth
tuple
(
unit
)
,
and
the
last
word+1
of
the
(
k
-
1
)
th
tuple
.
how
long
does
the
flight
last
cuanto
NULL
dura
el
vuelo
TUPLES
:
UNFOLDED
TUPLES
:
how
long#cuanto
how
long#cuanto
does#NULL
does#NULL
the
flight
last#dura
el
vuelo
last#dura
the#el
flight#vuelo
Figure
1
:
Comparing
regular
and
unfolded
tuples
.
Distance
are
measured
in
words
referring
to
the
units
source
side
.
To
reduce
the
computational
cost
we
place
limits
on
the
search
using
two
parameters
:
the
distortion
limit
(
the
maximum
distance
measured
in
words
that
a
tuple
is
allowed
to
be
reordered
,
m
)
and
the
reordering
limit
(
the
maximum
number
of
reordering
jumps
in
a
sentence
,
j
)
.
Tuples
need
to
be
extracted
by
an
unfolding
technique
(
Marino
et
al.
,
2006
)
.
This
means
that
the
tuples
are
broken
into
smaller
tuples
,
and
these
are
sequenced
in
the
order
of
the
target
words
.
In
order
not
to
lose
the
information
on
the
correct
order
,
the
decoder
performs
a
non-monotonic
search
.
Figure
1
shows
an
example
of
tuple
unfolding
compared
to
the
monotonic
extraction
.
The
unfolding
technique
produces
a
different
bilingual
n-gram
language
model
with
reordered
source
words
.
In
order
to
combine
the
models
in
the
decoder
suitably
,
an
optimization
tool
based
on
the
Simplex
algorithm
is
used
to
compute
log-linear
weights
for
each
model
.
3
Continuous
Space
N-gram
Models
The
architecture
of
the
neural
network
n-gram
model
is
shown
in
Figure
2
.
A
standard
fully-connected
multi-layer
perceptron
is
used
.
The
inputs
to
the
neural
network
are
the
indices
of
the
n
-
1
previous
units
(
words
or
tuples
)
in
the
vocabulary
hj
=
wj-n+1
,
.
.
.
,
wj-2
,
wj-1
and
the
outputs
are
the
posterior
probabilities
of
all
units
of
the
vocabulary
:
Neural
Network
discrete
continuous
LM
probabilities
representation
:
representation
:
for
all
words
indices
in
wordlist
P
dimensional
vectors
Figure
2
:
Architecture
of
the
continuous
space
LM
.
hj
denotes
the
context
wj-n+1
,
.
.
.
,
wj-1
.
P
is
the
size
of
one
projection
and
H
,
N
is
the
size
of
the
hidden
and
output
layer
respectively
.
When
shortlists
are
used
the
size
of
the
output
layer
is
much
smaller
than
the
size
of
the
vocabulary
.
where
N
is
the
size
of
the
vocabulary
.
The
input
uses
the
so-called
1-of-n
coding
,
i.e.
,
the
ith
unit
of
the
vocabulary
is
coded
by
setting
the
ith
element
of
the
vector
to
1
and
all
the
other
elements
to
0
.
The
ith
line
of
the
N
x
P
dimensional
projection
matrix
corresponds
to
the
continuous
representation
of
the
ith
unit
.
Let
us
denote
cl
these
projections
,
dj
the
hidden
layer
activities
,
oi
the
outputs
,
pi
their
soft-max
normalization
,
and
mji
,
bj
,
vij
and
ki
the
hidden
and
output
layer
weights
and
the
corresponding
biases
.
Using
these
notations
,
the
neural
network
performs
the
following
operations
:
The
value
of
the
output
neuron
pi
corresponds
directly
to
the
probability
P
(
wj
=
i
|
hj
)
.
Training
is
performed
with
the
standard
back-propagation
algorithm
minimizing
the
following
error
function
:
where
ti
denotes
the
desired
output
,
i.e.
,
the
probability
should
be
1.0
for
the
next
unit
in
the
training
sentence
and
0.0
for
all
the
other
ones
.
The
first
part
of
this
equation
is
the
cross-entropy
between
the
output
and
the
target
probability
distributions
,
and
the
second
part
is
a
regularization
term
that
aims
to
prevent
the
neural
network
from
over-fitting
the
training
data
(
weight
decay
)
.
The
parameter
/
has
to
be
determined
experimentally
.
Training
is
done
using
a
re-sampling
algorithm
as
described
in
(
Schwenk
,
2007
)
.
It
can
be
shown
that
the
outputs
of
a
neural
network
trained
in
this
manner
converge
to
the
posterior
probabilities
.
Therefore
,
the
neural
network
directly
minimizes
the
perplexity
on
the
training
data
.
Note
also
that
the
gradient
is
back-propagated
through
the
projection-layer
,
which
means
that
the
neural
network
learns
the
projection
of
the
units
onto
the
continuous
space
that
is
best
for
the
probability
estimation
task
.
In
general
,
the
complexity
to
calculate
one
probability
with
this
basic
version
of
the
neural
network
n-gram
model
is
dominated
by
the
dimension
of
the
output
layer
since
the
size
of
the
vocabulary
(
10k
to
64k
)
is
usually
much
larger
than
the
dimension
of
the
hidden
layer
(
200
to
500
)
.
Therefore
,
in
previous
applications
of
the
continuous
space
n-gram
model
,
the
output
was
limited
to
the
s
most
frequent
units
,
s
ranging
between
2k
and
12k
(
Schwenk
,
2007
)
.
This
is
called
a
short-list
.
Train
(
bitexts
)
Table
1
:
Available
data
in
the
supplied
resources
of
the
2006
Iwslt
evaluation
.
4
Experimental
Evaluation
In
this
work
we
report
results
on
the
Basic
Traveling
Expression
Corpus
(
Btec
)
as
used
in
the
2006
evaluations
of
the
international
workshop
on
spoken
language
translation
(
Iwslt
)
.
This
corpus
consists
of
typical
sentences
from
phrase
books
for
tourists
in
several
languages
(
Takezawa
et
al.
,
2002
)
.
We
report
results
on
the
supplied
development
corpus
of
489
sentences
and
the
official
test
set
of
the
Iwslt'06
evaluation
.
The
main
measure
is
the
BLEU
score
,
using
seven
reference
translations
.
The
scoring
is
case
insensitive
and
punctuations
are
ignored
.
Details
on
the
available
data
are
summarized
in
Table
1
.
We
concentrated
first
on
the
translation
from
Italian
to
English
.
All
participants
in
the
Iwslt
evaluation
achieved
much
better
performances
for
this
language
pair
than
for
the
other
considered
translation
directions
.
This
makes
it
more
difficult
to
achieve
additional
improvements
.
A
non-monotonic
search
was
performed
following
a
local
reordering
named
in
Section
2
,
setting
m
=
5
and
j
=
3
.
Also
we
used
histogram
pruning
in
the
decoder
,
i.e.
the
maximum
number
of
hypotheses
in
a
stack
is
limited
to
50
.
4.1
Language-dependent
preprocessing
Italian
contracted
prepositions
have
been
separated
into
preposition
+
article
,
such
as
'
alla'^'a
la
'
,
'
degli'^'di
gli
'
or
'
dallo'^'da
lo
'
,
among
others
.
The
training
and
development
data
for
the
bilingual
back-off
and
neural
network
translation
model
were
created
as
follows
.
Given
the
alignment
of
the
training
parallel
corpus
,
we
perform
a
unique
segmentation
of
each
parallel
sentence
following
the
criterion
of
unfolded
segmentation
seen
in
Section
2
.
This
segmentation
is
used
in
a
sequence
as
training
text
for
building
the
language
model
.
As
an
example
,
given
the
alignment
and
the
unfold
extraction
ofFig-ure
1
,
we
obtain
the
following
training
sentence
:
The
reference
bilingual
trigram
back-off
translation
model
was
trained
on
these
bilingual
tuples
us
-
ing
the
SRI
LM
toolkit
(
Stolcke
,
2002
)
.
Different
smoothing
techniques
were
tried
,
and
best
results
were
obtained
using
Good-Turing
discounting
.
The
neural
network
approach
was
trained
on
exactly
the
same
data
.
A
context
of
two
tuples
was
used
(
trigram
model
)
.
The
training
corpus
contains
about
21,500
different
bilingual
tuples
.
We
decided
to
limit
the
output
of
the
neural
network
to
the
8k
most
frequent
tuples
(
short-list
)
.
This
covers
about
90
%
of
the
requested
tuple
n-grams
in
the
training
data
.
Similar
to
previous
applications
,
the
neural
network
is
not
used
alone
but
interpolation
is
performed
to
combine
several
n-gram
models
.
First
of
all
,
the
neural
network
and
the
reference
back-off
model
are
interpolated
together
-
this
always
improved
performance
since
both
seem
to
be
complementary
.
Second
,
four
neural
networks
with
different
sizes
of
the
continuous
representation
were
trained
and
interpolated
together
.
This
usually
achieves
better
generalization
behavior
than
training
one
larger
neural
network
.
The
interpolation
coefficients
were
calculated
by
optimizing
perplexity
on
the
development
data
,
using
an
EM
procedure
.
The
obtained
values
are
0.33
for
the
back-off
translation
model
and
about
0.16
for
each
neural
network
model
respectively
.
This
interpolation
is
used
in
all
our
experiments
.
For
the
sake
of
simplicity
we
will
still
call
this
the
continuous
space
translation
model
.
Each
network
was
trained
independently
using
early
stopping
on
the
development
data
.
Convergence
was
achieved
after
about
10
iterations
through
the
training
data
(
less
than
20
minutes
of
processing
on
a
standard
Linux
machine
)
.
The
other
parameters
are
as
follows
:
•
Context
of
two
tuples
(
trigram
)
•
The
dimension
of
the
hidden
layer
was
set
to
•
The
weight
decay
coefficient
was
set
to
/
/
=
0.00005
.
N-gram
models
are
usually
evaluated
using
perplexity
on
some
development
data
.
In
our
case
,
i.e.
using
bilingual
tuples
as
basic
units
(
"
words
"
)
,
it
is
less
obvious
if
perplexity
is
a
useful
measure
.
Nevertheless
,
we
provide
these
numbers
for
completeness
.
The
perplexity
on
the
development
data
of
the
trigram
back-off
translation
model
is
227.0
.
This
could
be
reduced
to
170.4
using
the
neural
network
.
It
is
also
very
informative
to
analyze
the
n-gram
hit-rates
of
the
back-off
model
on
the
development
data
:
10
%
of
the
probability
requests
are
actually
a
true
trigram
,
40
%
a
bigram
and
about
49
%
are
finally
estimated
using
unigram
probabilities
.
This
means
that
only
a
limited
amount
of
phrase
context
is
used
in
the
standard
N-gram-based
translation
model
.
This
makes
this
an
ideal
candidate
to
apply
the
continuous
space
model
since
probabilities
are
interpolated
for
all
possible
contexts
and
never
backed-up
to
shorter
contexts
.
The
incorporation
of
the
neural
translation
model
is
done
using
n-best
list
.
Each
hypothesis
is
composed
of
a
sequence
of
bilingual
tuples
and
the
corresponding
scores
of
all
the
feature
functions
.
Figure
3
shows
an
example
of
such
an
n-best
list
.
The
neural
trigram
translation
model
is
used
to
replace
the
scores
of
the
trigram
back-off
translation
model
.
This
is
followed
by
a
re-optimization
of
the
coefficients
of
all
feature
functions
,
i.e.
maximization
of
the
BLEU
score
on
the
development
data
using
the
numerical
optimization
tool
CONDOR
(
Berghen
and
Bersini
,
2005
)
.
An
alternative
would
be
to
add
a
feature
function
and
to
combine
both
translation
models
under
the
log-linear
model
framework
,
using
maximum
BLEU
training
.
Another
open
question
is
whether
it
might
by
better
to
already
use
the
continuous
space
translation
model
during
decoding
.
The
continuous
space
model
has
a
much
higher
complexity
than
a
backoff
n-gram
.
However
,
this
can
be
heavily
optimized
when
rescoring
n-best
lists
,
i.e.
by
grouping
together
all
calls
in
the
whole
n-best
list
with
the
same
context
,
resulting
in
only
one
forward
pass
through
the
neural
network
.
This
is
more
difficult
to
perform
when
the
continuous
space
translation
model
is
used
during
decoding
.
Therefore
,
this
was
not
investigated
in
this
work
.
spiacenteii
sorry
tuttojoccupatottit
-
'
s
frill
spiacente#i_
'
m^sorry
tutto_occupato#it_
'
s-frill
spiacente#i_'m.afraid
tutto_occupato#it_'s-frill
spiacente#sorry
tutto#all
occupato#busy
spiacente#sorry
tutto#all
occupato#taken
Figure
3
:
Example
of
sentences
in
the
n-best
list
of
bilingual
tuples
.
The
special
character
'
#
'
is
used
to
separate
the
source
and
target
sentence
words
.
Several
words
in
one
tuple
a
grouped
together
using
In
all
our
experiments
1000-best
lists
were
used
.
In
order
to
evaluate
the
quality
of
these
n-best
lists
,
an
oracle
trigram
back-off
translation
model
was
build
on
the
development
data
.
Rescoring
the
n-best
lists
with
this
translation
model
resulted
in
an
increase
of
the
BLEU
score
of
about
10
points
(
see
Table
2
)
.
While
there
is
an
decrease
of
about
6
%
for
the
position
dependent
word
error
rate
(
mWER
)
,
a
smaller
change
in
the
position
independent
word
error
rate
was
observed
(
mPER
)
.
This
suggests
that
most
of
the
alternative
translation
hypothesis
result
in
word
reorderings
and
not
in
many
alternative
word
choices
.
This
is
one
of
the
major
drawbacks
of
phrase
-
and
N-gram-based
translation
systems
:
only
translations
observed
in
the
training
data
can
be
used
.
There
is
no
generalization
to
new
phrase
pairs
.
Back-off
BLEU
mWER
mPER
Table
2
:
Comparison
of
different
N-gram-translation
models
on
the
development
data
.
When
the
1000-best
lists
are
rescored
with
the
neural
network
translation
model
the
BLEU
score
increases
by
1.5
points
(
42.34
to
43.87
)
.
Similar
improvements
were
observed
in
the
word
error
rates
(
see
Table
2
)
.
For
comparison
,
a
4-gram
back-off
translation
model
was
also
built
,
but
no
change
of
the
BLEU
score
was
observed
.
This
suggests
that
careful
smoothing
is
more
important
than
increasing
the
context
when
estimating
the
translation
probabilities
in
an
N-gram-based
statistical
machine
translation
system
.
In
previous
work
,
we
have
investigated
the
use
of
the
neural
network
approach
to
modeling
the
target
language
for
the
Iwslt
task
(
Schwenk
et
al.
,
2006
)
.
We
also
applied
this
technique
to
this
improved
N-gram-based
translation
system
.
In
our
implementation
,
the
neural
network
target
4-gram
language
model
gives
an
improvement
of
1.3
points
BLEU
on
the
development
data
(
42.34
to
43.66
)
,
in
comparison
to
1.5
points
for
the
neural
translation
model
(
see
Table
3
)
.
Table
3
:
Combination
of
a
neural
translation
model
(
TM
)
and
a
neural
language
model
(
LM
)
.
BLEU
scores
on
the
development
data
.
The
neural
translation
and
target
language
model
were
also
applied
to
the
test
data
,
using
ofcourse
the
same
feature
function
coefficients
as
for
the
development
data
.
The
results
are
given
in
Table
4
for
all
the
official
measures
of
the
Iwslt
evaluation
.
The
new
smoothing
method
of
the
translation
probabilities
achieves
improvement
in
all
measures
.
It
gives
also
an
additional
gain
(
again
in
all
measures
)
when
used
together
with
a
neural
target
language
model
.
Surprisingly
,
neural
TM
and
neural
LM
improvements
almost
add
up
:
when
both
techniques
are
used
together
,
the
BLEU
scores
increases
by
1.5
points
(
36.97
—
38.50
)
.
Remember
that
the
reference
N-gram-based
translation
system
already
uses
a
local
reordering
approach
.
Back-off
TM+LM
neural
TM
neural
LM
neural
TM+LM
Table
4
:
Test
set
scores
for
the
combination
of
a
neural
translation
model
(
TM
)
and
a
neural
language
model
(
LM
)
.
5
Discussion
Phrase-based
approaches
are
the
de-facto
standard
in
statistical
machine
translation
.
The
phrases
are
extracted
automatically
from
the
word
alignments
of
parallel
texts
,
and
the
different
possible
translations
of
a
phrase
are
weighted
using
relative
frequency
.
This
can
be
problematic
when
the
data
is
sparse
.
However
,
there
seems
to
be
little
work
on
possible
improvements
of
the
relative
frequency
estimates
by
some
smoothing
techniques
.
It
is
today
common
practice
to
use
additional
feature
functions
like
IBM-1
scores
to
obtain
some
kind
of
smoothing
(
Och
et
al.
,
2004
;
Koehn
et
al.
,
2003
;
Zens
and
Ney
,
2004
)
,
but
better
estimation
of
the
phrase
probabilities
is
usually
not
addressed
.
An
alternative
way
to
represent
phrases
is
to
define
bilingual
tuples
.
Smoothing
,
and
context
dependency
,
is
obtained
by
using
an
n-gram
model
on
these
tuples
.
In
this
work
,
we
have
extended
this
approach
by
using
a
new
smoothing
technique
that
operates
on
a
continuous
representation
of
the
tuples
.
Our
method
is
distinguished
by
two
characteristics
:
better
estimation
of
the
numerous
unseen
n-grams
,
and
a
discriminative
estimation
of
the
tuple
probabilities
.
Results
are
provided
on
the
Btec
task
of
the
2006
Iwslt
evaluation
for
the
translation
direction
Italian
to
English
.
This
task
provides
very
limited
amount
of
resources
in
comparison
to
other
tasks
.
Therefore
,
new
techniques
must
be
deployed
to
take
the
best
advantage
of
the
limited
resources
.
We
have
chosen
the
Italian
to
English
task
because
it
is
challenging
to
enhance
a
good
quality
translation
task
(
over
40
BLEU
percentage
)
.
Using
the
continuous
space
model
for
the
translation
and
target
language
model
,
an
improvement
of
2.5
BLEU
on
the
development
data
and
1.5
BLEU
on
the
test
data
was
observed
.
Despite
these
encouraging
results
,
we
believe
that
additional
research
on
improved
estimation
of
probabilities
in
N-gram
-
or
phrase-based
statistical
machine
translation
systems
is
needed
.
In
particular
,
the
problem
of
generalization
to
new
translations
seems
to
be
promising
to
us
.
This
could
be
addressed
by
the
so-called
factored
phrase-based
model
as
implemented
in
the
Moses
decoder
(
Koehn
et
al.
,
2007
)
.
In
this
approach
words
are
decomposed
into
several
factors
.
These
factors
are
trans
-
lated
and
a
target
phrase
is
generated
.
This
model
could
be
complemented
by
a
factored
continuous
tuple
N-gram
.
Factored
word
language
models
were
already
successfully
used
in
speech
recognition
(
Bilmes
and
Kirchhoff
,
2003
;
Alexandrescu
and
Kirchhoff
,
2006
)
and
an
extension
to
machine
translation
seems
to
be
promising
.
The
described
smoothing
method
was
explicitly
developed
to
tackle
the
data
sparseness
problem
in
tasks
like
the
Btec
corpus
.
It
is
well
known
from
language
modeling
that
careful
smoothing
is
less
important
when
large
amounts
of
data
are
available
.
We
plan
to
investigate
whether
this
also
holds
for
smoothing
of
the
probabilities
in
phrase
-
or
tuple-based
statistical
machine
translation
systems
.
6
Acknowledgments
This
work
has
been
partially
funded
by
the
European
Union
under
the
integrated
project
Tc-Star
(
IST
-
2002-FP6-506738
)
,
by
the
French
Government
under
the
project
Instar
(
ANR
JCJC06_143038
)
and
the
the
Spanish
government
under
a
FPU
grant
and
the
project
Avivavoz
(
TEC2006-13964-C03
)
.
