We
achieved
a
state
of
the
art
performance
in
statistical
machine
translation
by
using
a
large
number
of
features
with
an
online
large-margin
training
algorithm
.
The
millions
of
parameters
were
tuned
only
on
a
small
development
set
consisting
of
less
than
1K
sentences
.
Experiments
on
Arabic-to-English
translation
indicated
that
a
model
trained
with
sparse
binary
features
outperformed
a
conventional
SMT
system
with
a
small
number
offeatures
.
1
Introduction
The
recent
advances
in
statistical
machine
translation
have
been
achieved
by
discriminatively
training
a
small
number
of
real-valued
features
based
either
on
(
hierarchical
)
phrase-based
translation
(
Och
and
Ney
,
2004
;
Koehn
et
al.
,
2003
;
Chiang
,
2005
)
or
syntax-based
translation
(
Galley
et
al.
,
2006
)
.
However
,
it
does
not
scale
well
with
a
large
number
of
features
of
the
order
of
millions
.
Tillmann
and
Zhang
(
2006
)
,
Liang
et
al.
(
2006
)
and
Bangalore
et
al.
(
2006
)
introduced
sparse
binary
features
for
statistical
machine
translation
trained
on
a
large
training
corpus
.
In
this
framework
,
the
problem
of
translation
is
regarded
as
a
sequential
labeling
problem
,
in
the
same
way
as
part-of-speech
tagging
,
chunking
or
shallow
parsing
.
However
,
the
use
of
a
large
number
of
features
did
not
provide
any
significant
improvements
over
a
conventional
small
feature
set
.
Bangalore
et
al.
(
2006
)
trained
the
lexical
choice
model
by
using
Conditional
Random
Fields
(
CRF
)
realized
on
a
WFST
.
Their
modeling
was
reduced
to
Maximum
Entropy
Markov
Model
(
MEMM
)
to
handle
a
large
number
of
features
which
,
in
turn
,
faced
the
labeling
bias
problem
(
Lafferty
et
al.
,
2001
)
.
Tillmann
and
Zhang
(
2006
)
trained
their
feature
set
using
an
online
discriminative
algorithm
.
Since
the
decoding
is
still
expensive
,
their
online
training
approach
is
approximated
by
enlarging
a
merged
k-best
list
one-by-one
with
a
1-best
output
.
Liang
et
al.
(
2006
)
introduced
an
averaged
perceptron
algorithm
,
but
employed
only
1-best
translation
.
In
Watanabe
et
al.
(
2006a
)
,
binary
features
were
trained
only
on
a
small
development
set
using
a
variant
of
voted
perceptron
for
reranking
k-best
translations
.
Thus
,
the
improvement
is
merely
relative
to
the
baseline
translation
system
,
namely
whether
or
not
there
is
a
good
translation
in
their
k-best
.
We
present
a
method
to
estimate
a
large
number
of
parameters
—
of
the
order
of
millions
—
using
an
online
training
algorithm
.
Although
it
was
intuitively
considered
to
be
prone
to
overfit-ting
,
training
on
a
small
development
set
—
less
than
1K
sentences
—
was
sufficient
to
achieve
improved
performance
.
In
this
method
,
each
training
sentence
is
decoded
and
weights
are
updated
at
every
iteration
(
Liang
et
al.
,
2006
)
.
When
updating
model
parameters
,
we
employ
a
memorization-variant
of
a
local
updating
strategy
(
Liang
et
al.
,
2006
)
in
which
parameters
are
optimized
toward
a
set
of
good
translations
found
in
the
k-best
list
across
iterations
.
The
objective
function
is
an
approximated
BLEU
(
Watanabe
et
al.
,
2006a
)
that
scales
the
loss
of
a
sentence
BLEU
to
a
document-wise
loss
.
The
parameters
are
trained
using
the
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
764-773
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
Margin
Infused
Relaxed
Algorithm
(
MIRA
)
(
Crammer
et
al.
,
2006
)
.
MIRA
is
successfully
employed
in
dependency
parsing
(
McDonald
et
al.
,
2005
)
or
the
joint-labeling
/
chunking
task
(
Shimizu
and
Haas
,
2006
)
.
Experiments
were
carried
out
on
an
Arabic-to-English
translation
task
,
and
we
achieved
significant
improvements
over
conventional
minimum
error
training
with
a
small
number
of
features
.
This
paper
is
organized
as
follows
:
First
,
Section
2
introduces
the
framework
of
statistical
machine
translation
.
As
a
baseline
SMT
system
,
we
use
the
hierarchical
phrase-based
translation
with
an
efficient
left-to-right
generation
(
Watanabe
et
al.
,
2006b
)
originally
proposed
by
Chiang
(
2005
)
.
In
Section
3
,
a
set
of
binary
sparse
features
are
defined
including
numeric
features
for
our
baseline
system
.
Section
4
introduces
an
online
large-margin
training
algorithm
using
MIRA
with
our
key
components
.
The
experiments
are
presented
in
Section
5
followed
by
discussion
in
Section
6
.
2
Statistical
Machine
Translation
We
use
a
log-linear
approach
(
Och
,
2003
)
in
which
a
foreign
language
sentence
f
is
translated
into
another
language
,
for
example
English
,
e
,
by
seeking
a
maximum
solution
:
where
h
(
f
,
e
)
is
a
large-dimension
feature
vector
.
w
is
a
weight
vector
that
scales
the
contribution
from
each
feature
.
Each
feature
can
take
any
real
value
,
such
as
the
log
of
the
n-gram
language
model
to
represent
fluency
,
or
a
lexicon
model
to
capture
the
word
or
phrase-wise
correspondence
.
2.1
Hierarchical
Phrase-based
SMT
Chiang
(
2005
)
introduced
the
hierarchical
phrase-based
translation
approach
,
in
which
non-terminals
are
embedded
in
each
phrase
.
A
translation
is
generated
by
hierarchically
combining
phrases
using
the
non-terminals
.
Such
a
quasi-syntactic
structure
can
naturally
capture
the
reordering
of
phrases
that
is
not
directly
modeled
by
a
conventional
phrase-based
approach
(
Koehn
et
al.
,
2003
)
.
The
non-terminal
embedded
phrases
are
learned
from
a
bilingual
corpus
without
a
linguistically
motivated
syntactic
structure
.
Based
on
hierarchical
phrase-based
modeling
,
we
adopted
the
left-to-right
target
generation
method
(
Watanabe
et
al.
,
2006b
)
.
This
method
is
able
to
generate
translations
efficiently
,
first
,
by
simplifying
the
grammar
so
that
the
target
side
takes
a
phrase-prefixed
form
,
namely
a
target
normalized
form
.
Second
,
a
translation
is
generated
in
a
left-to-right
manner
,
similar
to
the
phrase-based
approach
using
Earley-style
top-down
parsing
on
the
source
side
.
Coupled
with
the
target
normalized
form
,
n-gram
language
models
are
efficiently
integrated
during
the
search
even
with
a
higher
order
of
n.
2.2
Target
Normalized
Form
In
Chiang
(
2005
)
,
each
production
rule
is
restricted
to
a
rank-2
or
binarized
form
in
which
each
rule
contains
at
most
two
non-terminals
.
The
target
normalized
form
(
Watanabe
et
al.
,
2006b
)
further
imposes
a
constraint
whereby
the
target
side
of
the
aligned
right-hand
side
is
restricted
to
a
Greibach
Normal
Form
like
structure
:
where
X
is
a
non-terminal
,
y
is
a
source
side
string
of
arbitrary
terminals
and
/
or
non-terminals
.
bj3
is
a
corresponding
target
side
where
b
is
a
string
of
terminals
,
or
a
phrase
,
and
/
J
is
a
(
possibly
empty
)
string
of
non-terminals
.
~
defines
one-to-one
mapping
between
non-terminals
in
y
and
/
J.
The
use
of
phrase
b
as
a
prefix
maintains
the
strength
of
the
phrasebase
framework
.
A
contiguous
English
side
with
a
(
possibly
)
discontiguous
foreign
language
side
preserves
phrase-bounded
local
word
reordering
.
At
the
same
time
,
the
target
normalized
framework
still
combines
phrases
hierarchically
in
a
restricted
manner
.
2.3
Left-to-Right
Target
Generation
Decoding
is
performed
by
parsing
on
the
source
side
and
by
combining
the
projected
target
side
.
We
applied
an
Earley-style
top-down
parsing
approach
(
Wu
and
Wong
,
1998
;
Watanabe
et
al.
,
2006b
;
Zollmann
and
Venugopal
,
2006
)
.
The
basic
idea
is
to
perform
top-down
parsing
so
that
the
projected
target
side
is
generated
in
a
left-to-right
manner
.
The
search
is
guided
with
a
push-down
automaton
,
which
keeps
track
of
the
span
of
uncovered
source
word
positions
.
Combined
with
the
rest-cost
estimation
aggregated
in
a
bottom-up
way
,
our
decoder
efficiently
searches
for
the
most
likely
translation
.
The
use
of
a
target
normalized
form
further
simplifies
the
decoding
procedure
.
Since
the
rule
form
does
not
allow
any
holes
for
the
target
side
,
the
integration
with
an
n-gram
language
model
is
straightforward
:
the
prefixed
phrases
are
simply
concatenated
and
intersected
with
n-gram
.
3
Features
3.1
Baseline
Features
The
hierarchical
phrase-based
translation
system
employs
standard
numeric
value
features
:
•
n-gram
language
model
to
capture
the
fluency
of
the
target
side
.
•
Hierarchical
phrase
translation
probabilities
in
both
directions
,
h
(
y
\
b0
)
and_h
(
b
/
?
\
y
)
,
estimated
by
relative
counts
,
count
(
y
,
b8
)
.
•
Word-based
lexically
weighted
models
of
hiex
(
y
\
b0
)
and
hiex
(
b
/
?
\
y
)
using
lexical
translation
models
.
•
Word-based
insertion
/
deletion
penalties
that
penalize
through
the
low
probabilities
of
the
lexical
translation
models
(
Bender
et
al.
,
2004
)
.
•
Word
/
hierarchical-phrase
length
penalties
.
•
Backtrack-based
penalties
inspired
by
the
distortion
penalties
in
phrase-based
modeling
(
Watanabe
et
al.
,
2006b
)
.
In
addition
to
the
baseline
features
,
a
large
number
of
binary
features
are
integrated
in
our
MT
system
.
We
may
use
any
binary
features
,
such
as
{
1
English
word
"
violate
"
and
Arabic
word
"
tnthk
"
appeared
in
e
and
f.
0
otherwise
.
The
features
are
designed
by
considering
the
decoding
efficiency
and
are
based
on
the
word
alignment
structure
preserved
in
hierarchical
phrase
translation
pairs
(
Zens
and
Ney
,
2006
)
.
When
hierarchical
phrases
are
extracted
,
the
word
alignment
is
preserved
.
If
multiple
word
alignments
are
observed
Figure
1
:
An
example
of
sparse
features
for
a
phrase
translation
.
with
the
same
source
and
target
sides
,
only
the
frequently
observed
word
alignment
is
kept
to
reduce
the
grammar
size
.
Word
pair
features
reflect
the
word
correspondence
in
a
hierarchical
phrase
.
Figure
1
illustrates
an
example
of
sparse
features
for
a
phrase
translation
pair
f
,
f+2
and
ei+3
1
.
From
the
word
alignment
encoded
in
this
phrase
,
we
can
extract
word
pair
features
of
(
e
,
,
f+1
)
,
(
ei+2
,
f+2
)
and
(
e
/
+3
,
fj
)
.
The
bigrams
of
word
pairs
are
also
used
to
capture
the
contextual
dependency
.
We
assume
that
the
word
pairs
follow
the
target
side
ordering
.
For
instance
,
we
define
(
(
e
,
_1
,
fj-1
)
,
(
e
,
,
f+1
)
)
,
(
(
e
,
,
f+1
)
,
(
e
/
+2
,
fy+2
)
)
and
(
(
e
/
+2
,
fj+2
)
,
(
e
,
+3
,
fj
)
)
indicated
by
the
arrows
in
Figure
1
.
Extracting
bigram
word
pair
features
following
the
target
side
ordering
implies
that
the
corresponding
source
side
is
reordered
according
to
the
target
side
.
The
reordering
of
hierarchical
phrases
is
represented
by
using
contextually
dependent
word
pairs
across
their
boundaries
,
as
with
the
feature
(
(
e
,
_1
,
f-1
)
,
(
e
,
,
fj+1
)
)
in
Figure
1
.
The
above
features
are
insufficient
to
capture
the
translation
because
spurious
words
are
sometimes
inserted
in
the
target
side
.
Therefore
,
insertion
features
are
integrated
in
which
no
word
alignment
is
associated
in
the
target
.
The
inserted
words
are
associated
with
all
the
words
in
the
source
sentence
,
such
as
(
e
;
+1
,
f1
)
,
(
e
;
+1
,
f
)
for
the
non-aligned
word
e
;
+1
with
the
source
sentence
fJ
in
Figure
1
.
In
the
1For
simplicity
,
we
show
an
example
of
phrase
translation
pairs
,
but
it
is
trivial
to
define
the
features
over
hierarchical
phrases
.
Figure
2
:
Example
hierarchical
features
.
same
way
,
we
will
be
able
to
include
deletion
features
where
a
non-aligned
source
word
is
associated
with
the
target
sentence
.
However
,
this
would
lead
to
complex
decoding
in
which
all
the
translated
words
are
memorized
for
each
hypothesis
,
and
thus
not
integrated
in
our
feature
set
.
Target
side
bigram
features
are
also
included
to
directly
capture
the
fluency
as
in
the
n-gram
language
model
(
Roark
et
al.
,
2004
)
.
For
instance
,
bigram
features
of
(
e
,
_1
,
el
)
,
(
e
,
,
e
;
+0
,
(
e
,
+1
,
e
,
+2
)
.
.
.
are
observed
in
Figure
1
.
In
addition
to
the
phrase
motivated
features
,
we
included
features
inspired
by
the
hierarchical
structure
.
Figure
2
shows
an
example
of
hierarchical
phrases
in
the
source
side
,
consisting
of
Xtjj
—
»
Hierarchical
features
capture
the
dependency
of
the
source
words
in
a
parent
phrase
to
the
source
words
in
child
phrases
,
such
as
(
fj_1
,
fj
)
,
(
fj_1
,
f+0
,
(
fj+3
,
fj
)
,
(
fj+3
,
fj+1
)
,
(
fj
,
f+2
)
and
(
fj+1
,
f+2
)
as
indicated
by
the
arrows
in
Figure
2
.
The
hierarchical
features
are
extracted
only
for
those
source
words
that
are
aligned
with
the
target
side
to
limit
the
feature
size
.
3.3
Normalization
In
order
to
achieve
the
generalization
capability
,
the
following
normalized
tokens
are
introduced
for
each
surface
form
:
•
Word
class
or
POS
.
Algorithm
1
Online
Training
Algorithm
"
violate
"
is
normalized
to
"
viol+
"
and
"
+late
"
by
taking
the
prefix
and
suffix
,
respectively
.
"
@
@
@
@
/
@
/
@
@
"
.
We
consider
all
possible
combination
of
those
token
types
.
For
example
,
the
word
pair
feature
(
violate
,
tnthk
)
is
normalized
and
expanded
to
(
viol+
,
tnthk
)
,
(
viol+
,
tnth+
)
,
(
violate
,
tnth+
)
,
etc.
using
the
4-letter
prefix
token
type
.
4
Online
Large-Margin
Training
Algorithm
1
is
our
generic
online
training
algorithm
.
The
algorithm
is
slightly
different
from
other
online
training
algorithms
(
Tillmann
and
Zhang
,
2006
;
Liang
et
al.
,
2006
)
in
that
we
keep
and
update
oracle
translations
,
which
is
a
set
of
good
translations
reachable
by
a
decoder
according
to
a
metric
,
i.e.
BLEU
(
Papineni
et
al.
,
2002
)
.
In
line
3
,
a
k-best
list
is
generated
by
bestk
(
•
)
using
the
current
weight
vector
w
,
for
the
training
instance
of
(
ft
,
et
)
.
Each
training
instance
has
multiple
(
or
,
possibly
one
)
reference
translations
et
for
the
source
sentence
f.
Using
the
k-best
list
,
m-best
oracle
translations
Ot
is
updated
by
oraclem
(
-
)
for
every
iteration
(
line
4
)
.
Usually
,
a
decoder
cannot
generate
translations
that
exactly
match
the
reference
translations
due
to
its
beam
search
pruning
and
OOV
.
Thus
,
we
cannot
always
assign
scores
for
each
reference
translation
.
Therefore
,
possible
oracle
translations
are
maintained
according
to
an
objective
function
,
i.e.
BLEU
.
Tillmann
and
Zhang
(
2006
)
avoided
the
problem
by
precomputing
the
oracle
translations
in
advance
.
Liang
et
al.
(
2006
)
presented
a
similar
updating
strategy
in
which
parameters
were
updated
toward
an
oracle
translation
found
in
Ct
,
but
ignored
potentially
better
translations
discovered
in
the
past
iterations
.
New
w
+1
is
computed
using
the
k-best
list
Ct
with
respect
to
the
oracle
translations
Ot
(
line
5
)
.
After
N
iterations
,
the
algorithm
returns
an
averaged
weight
vector
to
avoid
overfitting
(
line
9
)
.
The
key
to
this
online
training
algorithm
is
the
selection
of
the
updating
scheme
in
line
5
.
4.1
Margin
Infused
Relaxed
Algorithm
The
Margin
Infused
Relaxed
Algorithm
(
MIRA
)
(
Crammer
et
al.
,
2006
)
is
an
online
version
of
the
large-margin
training
algorithm
for
structured
classification
(
Taskar
et
al.
,
2004
)
that
has
been
successfully
used
for
dependency
parsing
(
McDonald
et
al.
,
2005
)
and
joint-labeling
/
chunking
(
Shimizu
and
Haas
,
2006
)
.
The
basic
idea
is
to
keep
the
norm
of
the
updates
to
the
weight
vector
as
small
as
possible
,
considering
a
margin
at
least
as
large
as
the
loss
of
the
incorrect
classification
.
Line
5
of
the
weight
vector
update
procedure
in
Algorithm
1
is
replaced
by
the
solution
of
:
subject
to
where
S
(
ft
,
e
)
=
{
wZ
}
•
h
(
ft
,
e
)
.
£
(
•
)
is
a
nonnegative
slack
variable
and
C
&gt;
0
is
a
constant
to
control
the
influence
to
the
objective
function
.
A
larger
C
implies
larger
updates
to
the
weight
vector
.
L
(
)
is
a
loss
function
,
for
instance
difference
of
BLEU
,
that
measures
the
difference
between
e
and
e
'
according
to
the
reference
translations
et
.
In
this
update
,
a
margin
is
created
for
each
correct
and
incorrect
translation
at
least
as
large
as
the
loss
of
the
incorrect
translation
.
A
larger
error
means
a
larger
distance
between
the
scores
of
the
correct
and
incorrect
translations
.
Following
McDonald
et
al.
(
2005
)
,
only
k-best
translations
are
used
to
form
the
margins
in
order
to
reduce
the
number
of
constraints
in
Eq
.
In
the
translation
task
,
multiple
translations
are
acceptable
.
Thus
,
margins
for
m-oracle
translation
are
created
,
which
amount
to
m
X
k
large-margin
constraints
.
In
this
online
training
,
only
active
features
constrained
by
Eq
.
3
are
kept
and
updated
,
unlike
offline
training
in
which
all
possible
features
have
to
be
extracted
and
selected
in
advance
.
The
Lagrange
dual
form
of
Eq
.
3
is
:
with
the
weight
vector
update
:
Equation
4
is
solved
using
a
QP-solver
,
such
as
a
coordinate
ascent
algorithm
,
by
heuristically
selecting
(
e
,
e
'
)
and
by
updating
a
(
-
)
iteratively
:
C
is
used
to
clip
the
amount
ofupdates
.
A
single
oracle
with
1-best
translation
is
analytically
solved
without
a
QP-solver
and
is
represented
as
the
following
perceptron-like
update
(
Shimizu
Intuitively
,
the
update
amount
is
controlled
by
the
margin
and
the
loss
between
the
correct
and
incorrect
translations
and
by
the
closeness
of
two
translations
in
terms
of
feature
vectors
.
Indeed
,
Liang
et
al.
(
2006
)
employed
an
averaged
perceptron
algorithm
in
which
a
value
was
always
set
to
one
.
Tillmann
and
Zhang
(
2006
)
used
a
different
update
style
based
on
a
convex
loss
function
:
Table
1
:
Experimental
results
obtained
by
varying
normalized
tokens
used
with
surface
form
.
surface
form
w
/
prefix
/
suffix
w
/
word
class
w
/
digits
all
token
types
where
n
&gt;
0
is
a
learning
rate
for
controlling
the
convergence
.
4.2
Approximated
BLEU
where
/
?
„
(
•
)
is
the
n-gram
precision
of
hypothesized
given
reference
translations
translations
E
is
computed
for
a
set
of
sentences
,
not
for
a
single
sentence
.
Our
algorithm
requires
frequent
updates
on
the
weight
vector
,
which
implies
higher
cost
in
computing
the
document-wise
BLEU
.
Tillmann
and
Zhang
(
2006
)
and
Liang
et
al.
(
2006
)
solved
the
problem
by
introducing
a
sentence-wise
BLEU
.
However
,
the
use
of
the
sentence-wise
scoring
does
not
translate
directly
into
the
document-wise
score
because
of
the
n-gram
precision
statistics
and
the
brevity
penalty
statistics
aggregated
for
a
sentence
set
.
Thus
,
we
use
an
approximated
BLEU
score
that
basically
computes
BLEU
for
a
sentence
set
,
but
accumulates
the
difference
for
a
particular
sentence
(
Watanabe
et
al.
,
2006a
)
.
The
approximated
BLEU
is
computed
as
follows
:
Given
oracle
translations
O
for
T
,
we
maintain
the
best
oracle
translations
C
&gt;
T
=
je1eT
}
.
The
approximated
BLEU
for
a
hypothesized
translation
e
for
the
training
instance
(
f
,
et
)
is
computed
over
OT^
except
for
et
,
which
is
replaced
by
e
:
The
loss
computed
by
the
approximated
BLEU
measures
the
document-wise
loss
ofsubstituting
the
correct
translation
et
into
an
incorrect
translation
e
.
The
score
can
be
regarded
as
a
normalization
which
scales
a
sentence-wise
score
into
a
document-wise
score
.
5
Experiments
We
employed
our
online
large-margin
training
procedure
for
an
Arabic-to-English
translation
task
.
The
training
data
were
extracted
from
the
Arabic
/
English
news
/
UN
bilingual
corpora
supplied
by
LDC
.
The
data
amount
to
nearly
3.8M
sentences
.
The
Arabic
part
ofthe
bilingual
data
is
tokenized
by
isolating
Arabic
scripts
and
punctuation
marks
.
The
development
set
comes
from
the
MT2003
Arabic-English
NIST
evaluation
test
set
consisting
of
663
sentences
in
the
news
domain
with
four
reference
translations
.
The
performance
is
evaluated
by
the
news
domain
MT2004
/
MT2005
test
set
consisting
of
707
and
1,056
sentences
,
respectively
.
The
hierarchical
phrase
translation
pairs
are
extracted
in
a
standard
way
(
Chiang
,
2005
)
:
First
,
the
bilingual
data
are
word
alignment
annotated
by
running
GIZA++
(
Och
and
Ney
,
2003
)
in
two
directions
.
Second
,
the
word
alignment
is
refined
by
a
grow-diag-final
heuristic
(
Koehn
et
al.
,
2003
)
.
Third
,
phrase
translation
pairs
are
extracted
together
with
hierarchical
phrases
by
considering
holes
.
In
the
last
step
,
the
hierarchical
phrases
are
constrained
so
that
they
follow
the
target
normalized
form
constraint
.
A
5-gram
language
model
is
trained
on
the
English
side
of
the
bilingual
data
combined
with
the
English
Gigaword
from
LDC
.
First
,
the
use
of
normalized
token
types
in
Section
3.3
is
evaluated
in
Table
1
.
In
this
setting
,
all
the
structural
features
in
Section
3.2
are
used
,
but
differentiated
by
the
normalized
tokens
combined
with
surface
forms
.
Our
online
large-margin
training
algorithm
performed
50
iterations
constrained
Table
2
:
Experimental
results
obtained
by
incrementally
adding
structural
features
.
word
pairs
+
target
bigram
+
insertion
+
hierarchical
Table
3
:
Experimental
results
for
varying
/
c-best
and
m-oracle
translations
.
#
features
l-oracle
lG-oracle
sentence-BLEU
by
10-oracle
and
10-best
list
.
When
decoding
,
a
1000-best
list
is
generated
to
achieve
better
oracle
translations
.
The
training
took
nearly
1
day
using
8
cores
of
Opteron
.
The
translation
quality
is
evaluated
by
case-sensitive
NIST
(
Doddington
,
2002
)
and
BLEU
(
Papineni
et
al.
,
2002
)
2
.
The
table
also
shows
the
number
of
active
features
in
which
nonzero
values
were
assigned
as
weights
.
The
addition
of
prefix
/
suffix
tokens
greatly
increased
the
number
of
active
features
.
The
setting
severely
overfit
to
the
development
data
,
and
therefore
resulted
in
worse
results
in
open
tests
.
The
word
class3
with
surface
form
avoided
the
overfitting
problem
.
The
digit
sequence
normalization
provides
a
similar
generalization
capability
despite
of
the
moderate
increase
in
the
active
feature
size
.
By
including
all
token
types
,
we
achieved
better
NIST
/
BLEU
scores
for
the
2004
and
2005
test
sets
.
This
set
of
experiments
indicates
that
a
token
normalization
is
useful
especially
trained
on
a
small
data
.
Second
,
we
used
all
the
normalized
token
types
,
but
incrementally
added
structural
features
in
Table
2
.
Target
bigram
features
account
for
only
the
fluency
of
the
target
side
without
considering
the
source
/
target
correspondence
.
Therefore
,
the
in
-
3We
induced
50
classes
each
for
English
and
Arabic
.
clusion
of
target
bigram
features
clearly
overfit
to
the
development
data
.
The
problem
is
resolved
by
adding
insertion
features
which
can
take
into
account
an
agreement
with
the
source
side
that
is
not
directly
captured
by
word
pair
features
.
Hierarchical
features
are
somewhat
effective
in
the
2005
test
set
by
considering
the
dependency
structure
of
the
source
side
.
Finally
,
we
compared
our
online
training
algorithm
with
sparse
features
with
a
baseline
system
in
Table
3
.
The
baseline
hierarchical
phrase-based
system
is
trained
using
standard
max-BLEU
training
(
MERT
)
without
sparse
features
(
Och
,
2003
)
.
Table
3
shows
the
results
obtained
by
varying
the
m-oracle
and
k-best
size
(
k
,
m
=
1
,
10
)
using
all
structural
features
and
all
token
types
.
We
also
experimented
sentence-wise
BLEU
as
an
objective
function
constrained
by
10-oracle
and
10-best
list
.
Even
the
1-oracle
1-best
configuration
achieved
significant
improvements
over
the
baseline
system
.
The
use
of
a
larger
k-best
list
further
optimizes
to
the
development
set
,
but
at
the
cost
of
degraded
translation
quality
in
the
2004
test
set
.
The
larger
m-oracle
size
seems
to
be
harmful
if
coupled
with
the
1-best
list
.
As
indicated
by
the
reduced
active
feature
size
,
1-best
translation
seems
to
be
updated
toward
worse
translations
in
10-oracles
that
are
"
close
"
in
terms
of
features
.
We
achieved
significant
improvements
Table
4
:
Two-fold
cross
validation
experiments
.
closed
test
open
test
NIST
BLEU
baseline
when
the
k-best
list
size
was
also
increased
.
The
use
ofsentence-wise
BLEU
as
an
objective
provides
almost
no
improvement
in
the
2005
test
set
,
but
is
comparable
for
the
2004
test
set
.
sets
to
observe
the
effect
of
optimization
as
shown
in
Table
44
.
The
MERT
baseline
system
performed
similarly
both
in
closed
and
open
tests
.
Our
online
large-margin
training
with
10-oracle
and
10-best
constraints
and
the
approximated
BLEU
loss
function
significantly
outperformed
the
baseline
system
in
the
open
test
.
The
development
data
is
almost
doubled
in
this
setting
.
The
MERT
approach
seems
to
be
confused
with
the
slightly
larger
data
and
with
the
mixed
domains
from
different
epochs
.
6
Discussion
In
this
work
,
the
translation
model
consisting
ofmil-lions
of
features
are
successfully
integrated
.
In
order
to
avoid
poor
overfitting
,
features
are
limited
to
word-based
features
,
but
are
designed
to
reflect
the
structures
inside
hierarchical
phrases
.
One
of
the
benefit
of
MIRA
is
its
flexibility
.
We
may
include
as
many
constraints
as
possible
,
like
m-oracle
constraints
in
our
experiments
.
Although
we
described
experiments
on
the
hierarchical
phrase-based
translation
,
the
online
training
algorithm
is
applicable
to
any
translation
systems
,
such
as
phrase-based
translations
and
syntax-based
translations
.
Online
discriminative
training
has
already
been
studied
by
Tillmann
and
Zhang
(
2006
)
and
Liang
et
al.
(
2006
)
.
In
their
approach
,
training
was
performed
on
a
large
corpus
using
the
sparse
features
of
phrase
translation
pairs
,
target
«
-
grams
and
/
or
bag-of-word
pairs
inside
phrases
.
In
Tillmann
and
Zhang
4We
split
data
by
document
,
not
by
sentence
.
(
2006
)
,
k-best
list
generation
is
approximated
by
a
step-by-step
one-best
merging
method
that
separates
the
decoding
and
training
steps
.
The
weight
vector
update
scheme
is
very
similar
to
MIRA
but
based
on
a
convex
loss
function
.
Our
method
directly
employs
the
k-best
list
generated
by
the
fast
decoding
method
(
Watanabe
et
al.
,
2006b
)
at
every
iteration
.
One
of
the
benefits
is
that
we
avoid
the
rather
expensive
cost
of
merging
the
k-best
list
especially
when
handling
millions
of
features
.
Liang
et
al.
(
2006
)
employed
an
averaged
percep-tron
algorithm
.
They
decoded
each
training
instance
and
performed
a
perceptron
update
to
the
weight
vector
.
An
incorrect
translation
was
updated
toward
an
oracle
translation
found
in
a
k-best
list
,
but
discarded
potentially
better
translations
in
the
past
iterations
.
An
experiment
has
been
undertaken
using
a
small
development
set
together
with
sparse
features
for
the
reranking
of
a
k-best
translation
(
Watanabe
et
al.
,
2006a
)
.
They
relied
on
a
variant
of
a
voted
percep-tron
,
and
achieved
significant
improvements
.
However
,
their
work
was
limited
to
reranking
,
thus
the
improvement
was
relative
to
the
performance
of
the
baseline
system
,
whether
or
not
there
was
a
good
translation
in
a
list
.
In
our
work
,
the
sparse
features
are
directly
integrated
into
the
DP-based
search
.
The
design
of
the
sparse
features
was
inspired
by
Zens
and
Ney
(
2006
)
.
They
exploited
the
word
alignment
structure
inside
the
phrase
translation
pairs
for
discriminatively
training
a
reordering
model
in
their
phrase-based
translation
.
The
reordering
model
simply
classifies
whether
to
perform
monotone
decoding
or
not
.
The
trained
model
is
treated
as
a
single
feature
function
integrated
in
Eq
.
Our
approach
differs
in
that
each
sparse
feature
is
individually
integrated
in
Eq
.
7
Conclusion
We
exploited
a
large
number
of
binary
features
for
statistical
machine
translation
.
The
model
was
trained
on
a
small
development
set
.
The
optimization
was
carried
out
by
MIRA
,
which
is
an
online
version
of
the
large-margin
training
algorithm
.
Millions
of
sparse
features
are
intuitively
considered
prone
to
overfitting
,
especially
when
trained
on
a
small
development
set
.
However
,
our
algorithm
with
millions
of
features
achieved
very
significant
improvements
over
a
conventional
method
with
a
small
number
of
features
.
This
result
indicates
that
we
can
easily
experiment
many
alternative
features
even
with
a
small
data
set
,
but
we
believe
that
our
approach
can
scale
well
to
a
larger
data
set
for
further
improved
performance
.
Future
work
involves
scaling
up
to
larger
data
and
more
features
.
Acknowledgements
We
would
like
to
thank
reviewers
and
our
colleagues
for
useful
comment
and
discussion
.
