We
present
an
extension
of
phrase-based
statistical
machine
translation
models
that
enables
the
straight-forward
integration
of
additional
annotation
at
the
word-level
—
may
it
be
linguistic
markup
or
automatically
generated
word
classes
.
In
a
number
of
experiments
we
show
that
factored
translation
models
lead
to
better
translation
performance
,
both
in
terms
of
automatic
scores
,
as
well
as
more
grammatical
coherence
.
1
Introduction
The
current
state-of-the-art
approach
to
statistical
machine
translation
,
so-called
phrase-based
models
,
is
limited
to
the
mapping
of
small
text
chunks
without
any
explicit
use
of
linguistic
information
,
may
it
be
morphological
,
syntactic
,
or
semantic
.
Such
additional
information
has
been
demonstrated
to
be
valuable
by
integrating
it
in
pre-processing
or
postprocessing
steps
.
However
,
a
tighter
integration
of
linguistic
information
into
the
translation
model
is
desirable
for
two
reasons
:
•
Translation
models
that
operate
on
more
general
representations
,
such
as
lemmas
instead
of
surface
forms
of
words
,
can
draw
on
richer
statistics
and
overcome
the
data
sparseness
problems
caused
by
limited
training
data
.
•
Many
aspects
of
translation
can
be
best
explained
on
a
morphological
,
syntactic
,
or
semantic
level
.
Having
such
information
available
to
the
translation
model
allows
the
direct
modeling
of
these
aspects
.
For
instance
:
reordering
at
the
sentence
level
is
mostly
driven
morphology
word
class
Figure
1
:
Factored
representations
of
input
and
output
words
incorporate
additional
annotation
into
the
statistical
translation
model
.
by
general
syntactic
principles
,
local
agreement
constraints
show
up
in
morphology
,
etc.
Therefore
,
we
extended
the
phrase-based
approach
to
statistical
translation
to
tightly
integrate
additional
information
.
The
new
approach
allows
additional
annotation
at
the
word
level
.
A
word
in
our
framework
is
not
only
a
token
,
but
a
vector
of
factors
that
represent
different
levels
of
annotation
(
see
Figure
1
)
.
We
report
on
experiments
with
factors
such
as
surface
form
,
lemma
,
part-of-speech
,
morphological
features
such
as
gender
,
count
and
case
,
automatic
word
classes
,
true
case
forms
of
words
,
shallow
syntactic
tags
,
as
well
as
dedicated
factors
to
ensure
agreement
between
syntactically
related
items
.
This
paper
describes
the
motivation
,
the
modeling
aspects
and
the
computationally
efficient
decoding
methods
of
factored
translation
models
.
We
present
briefly
results
for
a
number
of
language
pairs
.
However
,
the
focus
of
this
paper
is
the
description
of
the
approach
.
Detailed
experimental
results
will
be
described
in
forthcoming
papers
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
868-876
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
2
Related
Work
Many
attempts
have
been
made
to
add
richer
information
to
statistical
machine
translation
models
.
Most
of
these
focus
on
the
pre-processing
of
the
input
to
the
statistical
system
,
or
the
post-processing
of
its
output
.
Our
framework
is
more
general
and
goes
beyond
recent
work
on
models
that
back
off
to
representations
with
richer
statistics
(
NieBen
and
Ney
,
2001
;
Yang
and
Kirchhoff
,
2006
;
Talbot
and
Osborne
,
2006
)
by
keeping
a
more
complex
representation
throughout
the
translation
process
.
Rich
morphology
often
poses
a
challenge
to
statistical
machine
translation
,
since
a
multitude
of
word
forms
derived
from
the
same
lemma
fragment
the
data
and
lead
to
sparse
data
problems
.
If
the
input
language
is
morphologically
richer
than
the
output
language
,
it
helps
to
stem
or
segment
the
input
in
a
pre-processing
step
,
before
passing
it
on
to
the
translation
system
(
Lee
,
2004
;
Sadat
and
Habash
,
2006
)
.
Structural
problems
have
also
been
addressed
by
pre-processing
:
Collins
et
al.
(
2005
)
reorder
the
input
to
a
statistical
system
to
closer
match
the
word
order
of
the
output
language
.
On
the
other
end
of
the
translation
pipeline
,
additional
information
has
been
used
in
post-processing
.
Och
et
al.
(
2004
)
report
minor
improvements
with
linguistic
features
on
a
Chinese-English
task
,
Koehn
and
Knight
(
2003
)
show
some
success
in
re-ranking
noun
phrases
for
German-English
.
In
their
approaches
,
first
,
an
n-best
list
with
the
best
translations
is
generated
for
each
input
sentence
.
Then
,
the
n-best
list
is
enriched
with
additional
features
,
for
instance
by
syntactically
parsing
each
candidate
translation
and
adding
a
parse
score
.
The
additional
features
are
used
to
rescore
the
n-best
list
,
resulting
possibly
in
a
better
best
translation
for
the
sentence
.
The
goal
of
integrating
syntactic
information
into
the
translation
model
has
prompted
many
researchers
to
pursue
tree-based
transfer
models
(
Wu
,
1997
;
Alshawi
et
al.
,
1998
;
Yamada
and
Knight
,
2001
;
Melamed
,
2004
;
Menezes
and
Quirk
,
2005
;
Galley
et
al.
,
2006
)
,
with
increasingly
encouraging
results
.
Our
goal
is
complementary
to
these
efforts
:
we
are
less
interested
in
recursive
syntactic
structure
,
but
in
richer
annotation
at
the
word
level
.
In
future
work
,
these
approaches
may
be
combined
.
word
lemma
part-of-speech
morphology
Figure
2
:
Example
factored
model
:
morphological
analysis
and
generation
,
decomposed
into
three
mapping
steps
(
translation
of
lemmas
,
translation
of
part-of-speech
and
morphological
information
,
generation
of
surface
forms
)
.
3
Motivating
Example
:
Morphology
One
example
to
illustrate
the
short-comings
of
the
traditional
surface
word
approach
in
statistical
machine
translation
is
the
poor
handling
of
morphology
.
Each
word
form
is
treated
as
a
token
in
itself
.
This
means
that
the
translation
model
treats
,
say
,
the
word
house
completely
independent
of
the
word
houses
.
Any
instance
of
house
in
the
training
data
does
not
add
any
knowledge
to
the
translation
of
houses
.
In
the
extreme
case
,
while
the
translation
of
house
may
be
known
to
the
model
,
the
word
houses
may
be
unknown
and
the
system
will
not
be
able
to
translate
it
.
While
this
problem
does
not
show
up
as
strongly
in
English
—
due
to
the
very
limited
morphological
inflection
in
English
—
it
does
constitute
a
significant
problem
for
morphologically
rich
languages
such
as
Arabic
,
German
,
Czech
,
etc.
Thus
,
it
may
be
preferably
to
model
translation
between
morphologically
rich
languages
on
the
level
of
lemmas
,
and
thus
pooling
the
evidence
for
different
word
forms
that
derive
from
a
common
lemma
.
In
such
a
model
,
we
would
want
to
translate
lemma
and
morphological
information
separately
,
and
combine
this
information
on
the
output
side
to
ultimately
generate
the
output
surface
words
.
Such
a
model
can
be
defined
straight-forward
as
a
factored
translation
model
.
See
Figure
2
for
an
illustration
of
this
model
in
our
framework
.
Note
that
while
we
illustrate
the
use
of
factored
translation
models
on
such
a
linguistically
motivated
example
,
our
framework
also
applies
to
models
that
incorporate
statistically
defined
word
classes
,
or
any
other
annotation
.
4
Decomposition
of
Factored
Translation
The
translation
of
factored
representations
of
input
words
into
the
factored
representations
of
output
words
is
broken
up
into
a
sequence
of
mapping
steps
that
either
translate
input
factors
into
output
factors
,
or
generate
additional
output
factors
from
existing
output
factors
.
Recall
the
example
of
a
factored
model
motivated
by
morphological
analysis
and
generation
.
In
this
model
the
translation
process
is
broken
up
into
the
following
three
mapping
steps
:
Translate
input
lemmas
into
output
lemmas
Translate
morphological
and
POS
factors
Generate
surface
forms
given
the
lemma
and
linguistic
factors
Factored
translation
models
build
on
the
phrase-based
approach
(
Koehn
et
al.
,
2003
)
that
breaks
up
the
translation
of
a
sentence
into
the
translation
of
small
text
chunks
(
so-called
phrases
)
.
This
approach
implicitly
defines
a
segmentation
of
the
input
and
output
sentences
into
phrases
.
See
an
example
in
Figure
3
.
Our
current
implementation
of
factored
translation
models
follows
strictly
the
phrase-based
approach
,
with
the
additional
decomposition
of
phrase
translation
into
a
sequence
of
mapping
steps
.
Translation
steps
map
factors
in
input
phrases
to
factors
in
output
phrases
.
Generation
steps
map
output
factors
within
individual
output
words
.
To
reiterate
:
all
translation
steps
operate
on
the
phrase
level
,
while
all
generation
steps
operate
on
the
word
level
.
Since
all
mapping
steps
operate
on
the
same
phrase
segmentation
of
the
input
and
output
sentence
into
phrase
pairs
,
we
call
these
synchronous
factored
models
.
Let
us
now
take
a
closer
look
at
one
example
,
the
translation
of
the
one-word
phrase
hauser
into
English
.
The
representation
of
hauser
in
German
is
:
surface-form
hauser
|
lemma
haus
|
part-of-speech
NNI
count
plural
|
case
nominative
|
gender
neutral
.
neue
häuser
werden
gebaut
new
houses
are
built
Figure
3
:
Example
sentence
translation
by
a
standard
phrase
model
.
Factored
models
extend
this
approach
.
The
three
mapping
steps
in
our
morphological
analysis
and
generation
model
may
provide
the
following
applicable
mappings
:
•
NN
\
plural-nominative-neutral
—
NN
\
plural
,
NN
\
singular
Generation
:
Generating
surface
forms
We
call
the
application
of
these
mapping
steps
to
an
input
phrase
expansion
.
Given
the
multiple
choices
for
each
step
(
reflecting
the
ambiguity
in
translation
)
,
each
input
phrase
may
be
expanded
into
a
list
of
translation
options
.
The
German
hauser
|
haus
|
NN
|
plural-nominative-neutral
may
be
expanded
as
follows
:
Translation
:
Mapping
lemmas
Translation
:
Mapping
morphology
buildings
|
building
|
NN
|
plural
,
5
Statistical
Model
Factored
translation
models
follow
closely
the
statistical
modeling
approach
of
phrase-based
models
(
in
fact
,
phrase-based
models
are
a
special
case
of
factored
models
)
.
The
main
difference
lies
in
the
preparation
of
the
training
data
and
the
type
of
models
learned
from
the
data
.
The
training
data
(
a
parallel
corpus
)
has
to
be
annotated
with
the
additional
factors
.
For
instance
,
if
we
want
to
add
part-of-speech
information
on
the
input
and
output
side
,
we
need
to
obtain
part-of-speech
tagged
training
data
.
Typically
this
involves
running
automatic
tools
on
the
corpus
,
since
manually
annotated
corpora
are
rare
and
expensive
to
produce
.
Next
,
we
need
to
establish
a
word-alignment
for
all
the
sentences
in
the
parallel
training
corpus
.
Here
,
we
use
the
same
methodology
as
in
phrase-based
models
(
typically
symmetrized
GIZA++
alignments
)
.
The
word
alignment
methods
may
operate
on
the
surface
forms
of
words
,
or
on
any
of
the
other
factors
.
In
fact
,
some
preliminary
experiments
have
shown
that
word
alignment
based
on
lemmas
or
stems
yields
improved
alignment
quality
.
Each
mapping
step
forms
a
component
of
the
overall
model
.
From
a
training
point
of
view
this
means
that
we
need
to
learn
translation
and
generation
tables
from
the
word-aligned
parallel
corpus
and
define
scoring
methods
that
help
us
to
choose
between
ambiguous
mappings
.
Phrase-based
translation
models
are
acquired
from
a
word-aligned
parallel
corpus
by
extracting
all
phrase-pairs
that
are
consistent
with
the
word
alignment
.
Given
the
set
of
extracted
phrase
pairs
with
counts
,
various
scoring
functions
are
estimated
,
such
as
conditional
phrase
translation
probabilities
based
on
relative
frequency
estimation
or
lexical
translation
probabilities
based
on
the
words
in
the
phrases
.
In
our
approach
,
the
models
for
the
translation
steps
are
acquired
in
the
same
manner
from
a
word-aligned
parallel
corpus
.
For
the
specified
factors
in
the
input
and
output
,
phrase
mappings
are
extracted
.
The
set
of
phrase
mappings
(
now
over
factored
representations
)
is
scored
based
on
relative
counts
and
word-based
translation
probabilities
.
The
generation
distributions
are
estimated
on
the
output
side
only
.
The
word
alignment
plays
no
role
here
.
In
fact
,
additional
monolingual
data
may
be
used
.
The
generation
model
is
learned
on
a
word-for-word
basis
.
For
instance
,
for
a
generation
step
that
maps
surface
forms
to
part-of-speech
,
a
table
with
entries
such
as
(
fish
,
NN
)
is
constructed
.
One
or
more
scoring
functions
may
be
defined
over
this
table
,
in
our
experiments
we
used
both
conditional
probability
distributions
,
e.g.
,
p
(
fish
\
NN
)
and
p
(
NN
\
fish
)
,
obtained
by
maximum
likelihood
estimation
.
An
important
component
of
statistical
machine
translation
is
the
language
model
,
typically
an
n-gram
model
over
surface
forms
of
words
.
In
the
framework
of
factored
translation
models
,
such
sequence
models
may
be
defined
over
any
factor
,
or
any
set
of
factors
.
For
factors
such
as
part-of-speech
tags
,
building
and
using
higher
order
n-gram
models
(
7-gram
,
9-gram
)
is
straight-forward
.
5.2
Combination
of
Components
As
in
phrase-based
models
,
factored
translation
models
can
be
seen
as
the
combination
of
several
components
(
language
model
,
reordering
model
,
translation
steps
,
generation
steps
)
.
These
components
define
one
or
more
feature
functions
that
are
combined
in
a
log-linear
model
:
is
a
normalization
constant
that
is
ignored
in
practice
.
To
compute
the
probability
of
a
translation
e
given
an
input
sentence
f
,
we
have
to
evaluate
each
feature
function
hi
.
For
instance
,
the
feature
function
for
a
bigram
language
model
component
is
(
m
is
the
number
of
words
ei
in
the
sentence
e
)
:
Let
us
now
consider
the
feature
functions
introduced
by
the
translation
and
generation
steps
of
factored
translation
models
.
The
translation
ofthe
input
sentence
f
into
the
output
sentence
e
breaks
down
to
a
set
of
phrase
translations
{
(
fj
,
e.j
)
}
.
For
a
translation
step
component
,
each
feature
function
hT
is
defined
over
the
phrase
pairs
(
fj
,
ej
)
given
a
scoring
function
t
:
For
a
generation
step
component
,
each
feature
function
hG
given
a
scoring
function
7
is
defined
over
the
output
words
ek
only
:
The
feature
functions
follow
from
the
scoring
functions
(
t
,
7
)
acquired
during
the
training
of
translation
and
generation
tables
.
For
instance
,
recall
our
earlier
example
:
a
scoring
function
for
a
generation
model
component
that
is
a
conditional
probability
distribution
between
input
and
output
factors
,
e.g.
,
7
(
fish
,
NN
,
singular
)
=
p
(
NNfish
)
.
The
feature
weights
Ai
in
the
log-linear
model
are
determined
using
a
minimum
error
rate
training
method
,
typically
Powell
's
method
(
Och
,
2003
)
.
5.3
Efficient
Decoding
Compared
to
phrase-based
models
,
the
decomposition
of
phrase
translation
into
several
mapping
steps
creates
additional
computational
complexity
.
Instead
of
a
simple
table
look-up
to
obtain
the
possible
translations
for
an
input
phrase
,
now
multiple
tables
have
to
be
consulted
and
their
content
combined
.
In
phrase-based
models
it
is
easy
to
identify
the
entries
in
the
phrase
table
that
may
be
used
for
a
specific
input
sentence
.
These
are
called
translation
options
.
We
usually
limit
ourselves
to
the
top
20
translation
options
for
each
input
phrase
.
The
beam
search
decoding
algorithm
starts
with
an
empty
hypothesis
.
Then
new
hypotheses
are
generated
by
using
all
applicable
translation
options
.
These
hypotheses
are
used
to
generate
further
hypotheses
in
the
same
manner
,
and
so
on
,
until
hypotheses
are
created
that
cover
the
full
input
sentence
.
The
highest
scoring
complete
hypothesis
indicates
the
best
translation
according
to
the
model
.
How
do
we
adapt
this
algorithm
for
factored
translation
models
?
Since
all
mapping
steps
operate
on
the
same
phrase
segmentation
,
the
expansions
of
these
mapping
steps
can
be
efficiently
pre-computed
prior
to
the
heuristic
beam
search
,
and
stored
as
translation
options
.
For
a
given
input
phrase
,
all
possible
translation
options
are
thus
computed
before
part-of-speech
Figure
4
:
Syntactically
enriched
output
:
By
generating
additional
linguistic
factors
on
the
output
side
,
high-order
sequence
models
over
these
factors
support
syntactical
coherence
of
the
output
.
decoding
(
recall
the
example
in
Section
4
,
where
we
carried
out
the
expansion
for
one
input
phrase
)
.
This
means
that
the
fundamental
search
algorithm
does
not
change
.
However
,
we
need
to
be
careful
about
combinatorial
explosion
of
the
number
of
translation
options
given
a
sequence
of
mapping
steps
.
In
other
words
,
the
expansion
may
create
too
many
translation
options
to
handle
.
If
one
or
many
mapping
steps
result
in
a
vast
increase
of
(
intermediate
)
expansions
,
this
may
be
become
unmanageable
.
We
currently
address
this
problem
by
early
pruning
of
expansions
,
and
limiting
the
number
of
translation
options
per
input
phrase
to
a
maximum
number
,
by
default
50
.
This
is
,
however
,
not
a
perfect
solution
.
We
are
currently
working
on
a
more
efficient
search
for
the
top
50
translation
options
to
replace
the
current
brute-force
approach
.
6
Experiments
We
carried
out
a
number
of
experiments
using
the
factored
translation
model
framework
,
incorporating
both
linguistic
information
and
automatically
generated
word
classes
.
This
work
is
implemented
as
part
of
the
open
source
Moses1
system
(
Koehn
et
al.
,
2007
)
.
We
used
the
default
settings
for
this
system
.
6.1
Syntactically
Enriched
Output
In
the
first
set
of
experiments
,
we
translate
surface
forms
of
words
and
generate
additional
output
factors
from
them
(
see
Figure
4
for
an
illustration
)
.
By
adding
morphological
and
shallow
syntactic
infor
-
English-German
best
published
result
surface
+
POS
English-Spanish
surface
+
morph
English-Czech
surface
+
all
morph
surface
+
case
/
number
/
gender
surface
+
CNG
/
verb
/
prepositions
Table
1
:
Experimental
results
with
syntactically
enriched
output
(
part
of
speech
,
morphology
)
mation
,
we
are
able
to
use
high-order
sequence
models
(
just
like
n-gram
language
models
over
words
)
in
order
to
support
syntactic
coherence
of
the
output
.
Table
1
summarizes
the
experimental
results
.
The
English-German
systems
were
trained
on
the
full
751,088
sentence
Europarl
corpus
and
evaluated
on
the
WMT
2006
test
set
(
Koehn
and
Monz
,
2006
)
.
Adding
part-of-speech
and
morphological
factors
on
the
output
side
and
exploiting
them
with
7-gram
sequence
models
results
in
minor
improvements
in
BLEU
.
The
model
that
incorporates
both
POS
and
morphology
(
18.22
%
BLEU
vs.
baseline
18.04
%
BLEU
)
ensures
better
local
grammatical
coherence
.
The
baseline
system
produces
often
phrases
such
as
zur
(
to
)
zwischenstaatlichen
(
inter-governmental
)
methoden
(
methods
)
,
with
a
mismatch
between
the
determiner
(
singular
)
and
the
noun
(
plural
)
,
while
the
adjective
is
ambiguous
.
In
a
manual
evaluation
of
intra-NP
agreement
we
found
that
the
factored
model
reduced
the
disagreement
error
within
noun
phrases
of
length
&gt;
3
from
15
%
to
4
%
.
English-Spanish
systems
were
trained
on
a
40,000
sentence
subset
of
the
Europarl
corpus
.
Here
,
we
also
used
morphological
and
part-of-speech
fac
-
tors
on
the
output
side
with
an
7-gram
sequence
model
,
resulting
in
absolute
improvements
of
1.25
%
(
only
morph
)
and
0.84
%
(
morph+POS
)
.
Improvements
on
the
full
Europarl
corpus
are
smaller
.
English-Czech
systems
were
trained
on
a
20,000
sentence
Wall
Street
Journal
corpus
.
Morphological
features
were
exploited
with
a
7-gram
language
model
.
Experimentation
suggests
that
it
is
beneficial
to
carefully
consider
which
morphological
features
to
be
used
.
Adding
all
features
results
in
lower
performance
(
27.04
%
BLEU
)
,
than
considering
only
case
,
number
and
gender
(
27.45
%
BLEU
)
or
additionally
verbial
(
person
,
tense
,
and
aspect
)
and
prepositional
(
lemma
and
case
)
morphology
(
27.62
%
BLEU
)
.
All
these
models
score
well
above
the
baseline
of25.82
%
BLEU
.
An
extended
description
of
these
experiments
is
in
the
JHU
workshop
report
(
Koehn
et
al.
,
2006
)
.
6.2
Morphological
Analysis
and
Generation
The
next
model
is
the
one
described
in
our
motivating
example
in
Section
4
(
see
also
Figure
2
)
.
Instead
of
translating
surface
forms
of
words
,
we
translate
word
lemma
and
morphology
separately
,
and
generate
the
surface
form
of
the
word
on
the
output
side
.
We
carried
out
experiments
for
the
language
pair
German-English
,
using
the
52,185
sentence
News
Commentary
corpus2
.
We
report
results
on
the
development
test
set
,
which
is
also
the
out-of-domain
test
set
of
the
WMT06
workshop
shared
task
(
Koehn
and
Monz
,
2006
)
.
German
morphological
analysis
and
POS
tagging
was
done
using
LoPar
Schmidt
and
Schulte
im
Walde
(
2000
)
,
English
POS
tagging
was
done
with
Brill
's
tagger
(
Brill
,
1995
)
,
followed
by
a
simple
lemmatizer
based
on
tagging
results
.
Experimental
results
are
summarized
in
Table
2
.
For
this
data
set
,
we
also
see
an
improvement
when
using
a
part-of-speech
language
model
—
the
BLEU
score
increases
from
18.19
%
to
19.05
%
—
consistent
with
the
results
reported
in
the
previous
section
.
However
,
moving
from
a
surface
word
translation
mapping
to
a
lemma
/
morphology
mapping
leads
to
a
deterioration
of
performance
to
a
BLEU
score
of
14.46
%
.
Note
that
this
model
completely
ignores
the
surface
forms
of
input
words
and
only
relies
on
the
German-English
pure
lemma
/
morph
model
backoff
lemma
/
morph
model
Table
2
:
Experimental
results
with
morphological
analysis
and
generation
model
(
Figure
2
)
,
using
News
Commentary
corpus
more
general
lemma
and
morphology
information
.
While
this
allows
the
translation
of
word
forms
with
known
lemma
and
unknown
surface
form
,
on
balance
it
seems
to
be
disadvantage
to
throw
away
surface
form
information
.
To
overcome
this
problem
,
we
introduce
an
alternative
path
model
:
Translation
options
in
this
model
may
come
either
from
the
surface
form
model
or
from
the
lemma
/
morphology
model
we
just
described
.
For
surface
forms
with
rich
evidence
in
the
training
data
,
we
prefer
surface
form
mappings
,
and
for
surface
forms
with
poor
or
no
evidence
in
the
training
data
we
decompose
surface
forms
into
lemma
and
morphology
information
and
map
these
separately
.
The
different
translation
tables
form
different
components
in
the
log-linear
model
,
whose
weights
are
set
using
standard
minimum
error
rate
training
methods
.
The
alternative
path
model
outperforms
the
surface
form
model
with
POS
LM
,
with
an
BLEU
score
of
19.47
%
vs.
19.05
%
.
The
test
set
has
3276
unknown
word
forms
vs
2589
unknown
lemmas
(
out
of
26,898
words
)
.
Hence
,
the
lemma
/
morph
model
is
able
to
translate
687
additional
words
.
6.3
Use
of
Automatic
Word
Classes
Finally
,
we
went
beyond
linguistically
motivated
factors
and
carried
out
experiments
with
automatically
trained
word
classes
.
By
clustering
words
together
by
their
contextual
similarity
,
we
are
able
to
find
statistically
similarities
that
may
lead
to
more
generalized
and
robust
models
.
English-Chinese
baseline
(
surface
)
surface
+
word
class
Table
3
:
Experimental
result
with
automatic
word
classes
obtained
by
word
clustering
Chinese-English
Recase
Method
Standard
two-pass
:
SMT
+
recase
Integrated
factored
model
(
optimized
)
Input
Output
lower-cased
\
_J
mixed-cased
Table
4
:
Experimental
result
with
integrated
recas
-
(
Shen
et
al.
,
2006
)
.
6.4
Integrated
Recasing
To
demonstrate
the
versatility
of
the
factored
translation
model
approach
,
consider
the
task
of
recas
-
in
statistical
machine
translation
,
the
training
data
is
lowercased
to
generalize
over
differently
cased
surface
forms
—
say
,
the
,
The
,
THE
—
which
necessitates
a
post-processing
step
to
restore
case
in
the
output
.
With
factored
translation
models
,
it
is
possible
to
integrate
this
step
into
the
model
,
by
adding
a
generation
step
.
See
Table
4
for
an
illustration
of
this
model
and
experimental
results
on
the
IWSLT
2006
task
(
Chinese-English
)
.
The
integrated
recas-ing
model
outperform
the
standard
approach
with
an
BLEU
score
of
21.08
%
to
20.65
%
.
For
more
on
this
experiment
,
see
(
Shen
et
al.
,
2006
)
.
6.5
Additional
Experiments
Factored
translation
models
have
also
been
used
for
the
integration
of
CCG
supertags
(
Birch
et
al.
,
2007
)
,
domain
adaptation
(
Koehn
and
Schroeder
,
2007
)
and
for
the
improvement
of
English-Czech
translation
(
Bojar
,
2007
)
.
7
Conclusion
and
Future
Work
We
presented
an
extension
of
the
state-of-the-art
phrase-based
approach
to
statistical
machine
translation
that
allows
the
straight-forward
integration
of
additional
information
,
may
it
come
from
linguistic
tools
or
automatically
acquired
word
classes
.
We
reported
on
experiments
that
showed
gains
over
standard
phrase-based
models
,
both
in
terms
of
automatic
scores
(
gains
of
up
to
2
%
BLEU
)
,
as
well
as
a
measure
of
grammatical
coherence
.
These
experiments
demonstrate
that
within
the
framework
of
factored
translation
models
additional
information
can
be
successfully
exploited
to
overcome
some
short-comings
of
the
currently
dominant
phrase-based
statistical
approach
.
The
framework
of
factored
translation
models
is
very
general
.
Many
more
models
that
incorporate
different
factors
can
be
quickly
built
using
the
existing
implementation
.
We
are
currently
exploring
these
possibilities
,
for
instance
use
of
syntactic
information
in
reordering
and
models
with
augmented
input
information
.
We
have
not
addressed
all
computational
problems
of
factored
translation
models
.
In
fact
,
computational
problems
hold
back
experiments
with
more
complex
factored
models
that
are
theoretically
possible
but
too
computationally
expensive
to
carry
out
.
Our
current
focus
is
to
develop
a
more
efficient
implementation
that
will
enable
these
experiments
.
Moreover
,
we
expect
to
overcome
the
constraints
of
the
currently
implemented
synchronous
factored
models
by
developing
a
more
general
asynchronous
framework
,
where
multiple
translation
steps
may
operate
on
different
phrase
segmentations
(
for
instance
a
part-of-speech
model
for
large
scale
reordering
)
.
Acknowledgments
This
work
was
supported
in
part
under
the
GALE
program
of
the
Defense
Advanced
Research
Projects
Agency
,
Contract
No
NR0011-06-C-0022
and
in
part
under
the
EuroMatrix
project
funded
by
the
European
Commission
(
6th
Framework
Programme
)
.
We
also
benefited
greatly
from
a
2006
summer
workshop
hosted
by
the
Johns
Hopkins
University
and
would
like
thank
the
other
workshop
participants
for
their
support
and
insights
,
namely
Nicola
Bertoldi
,
Ondrej
Bojar
,
Chris
Callison-Burch
,
Alexandra
Constantin
,
Brooke
Cowan
,
Chris
Dyer
,
Marcello
Federico
,
Evan
Herbst
Christine
Moran
,
Wade
Shen
,
and
Richard
Zens
.
