Parallel
corpus
is
an
indispensable
resource
for
translation
model
training
in
statistical
machine
translation
(
SMT
)
.
Instead
of
collecting
more
and
more
parallel
training
corpora
,
this
paper
aims
to
improve
SMT
performance
by
exploiting
full
potential
of
the
existing
parallel
corpora
.
Two
kinds
of
methods
are
proposed
:
offline
data
optimization
and
online
model
optimization
.
The
offline
method
adapts
the
training
data
by
redistributing
the
weight
of
each
training
sentence
pairs
.
The
online
method
adapts
the
translation
model
by
redistributing
the
weight
of
each
predefined
submodels
.
Information
retrieval
model
is
used
for
the
weighting
scheme
in
both
methods
.
Experimental
results
show
that
without
using
any
additional
resource
,
both
methods
can
improve
SMT
performance
significantly
.
1
Introduction
Statistical
machine
translation
relies
heavily
on
the
available
training
data
.
Typically
,
the
more
data
is
used
to
estimate
the
parameters
of
the
translation
model
,
the
better
it
can
approximate
the
true
translation
probabilities
,
which
will
obviously
lead
to
a
higher
translation
performance
.
However
,
large
corpora
are
not
easily
available
.
The
collected
corpora
are
usually
from
very
different
areas
.
For
example
,
the
parallel
corpora
provided
by
LDC
come
from
quite
different
domains
,
such
as
Hongkong
laws
,
Hangkong
Hansards
and
Hongkong
news
.
This
results
in
the
problem
that
a
translation
system
trained
on
data
from
a
particular
domain
(
e.g.
Hongkong
Hansards
)
will
perform
poorly
when
translating
text
from
a
different
domain
(
e.g.
news
articles
)
.
Our
experiments
also
show
that
simply
putting
all
these
domain
specific
corpora
together
will
not
always
improve
translation
quality
.
From
another
aspect
,
larger
amount
of
training
data
also
requires
larger
computational
resources
.
With
the
increasing
of
training
data
,
the
improvement
of
translation
quality
will
become
smaller
and
smaller
.
Therefore
,
while
keeping
collecting
more
and
more
parallel
corpora
,
it
is
also
important
to
seek
effective
ways
of
making
better
use
of
available
parallel
training
data
.
There
are
two
cases
when
we
train
a
SMT
system
.
In
one
case
,
we
know
the
target
test
set
or
target
test
domain
,
for
example
,
when
building
a
specific
domain
SMT
system
or
when
participating
the
NIST
MT
evaluation1
.
In
the
other
case
,
we
are
unaware
of
any
information
of
the
testing
data
.
This
paper
presents
two
methods
to
exploit
full
potential
of
the
available
parallel
corpora
in
the
two
cases
.
For
the
first
case
,
we
try
to
optimize
the
training
data
offline
to
make
it
match
the
test
data
better
in
domain
,
topic
and
style
,
thus
improving
the
translation
performance
.
For
the
second
case
,
we
first
divide
the
training
data
into
several
domains
and
train
submodels
for
each
domain
.
Then
,
in
the
translation
process
,
we
try
to
optimize
the
predefined
models
according
to
the
online
input
source
sentence
.
Information
retrieval
model
is
used
for
similar
sentences
retrieval
in
both
methods
.
Our
preliminary
experiments
show
that
both
methods
can
improve
SMT
performance
without
using
any
additional
data
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
343-350
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
The
remainder
of
this
paper
is
organized
as
follows
:
Section
2
describes
the
offline
data
selection
and
optimization
method
.
Section
3
describes
the
online
model
optimization
method
.
The
evaluation
and
discussion
are
given
in
section
4
.
Related
work
is
introduced
before
concluding
.
2
Offline
training
data
optimization
In
offline
training
data
optimization
,
we
assume
that
the
target
test
data
or
target
test
domain
is
known
before
building
the
translation
model
.
We
first
select
sentences
similar
to
the
test
text
using
information
retrieval
method
to
construct
a
small
and
adapted
training
data
.
Then
the
extracted
similar
subset
is
used
to
optimize
the
distribution
of
the
whole
training
data
.
The
adapted
and
the
optimized
training
data
will
be
used
to
train
new
translation
models
.
2.1
Similar
data
selection
using
TF-IDF
We
use
information
retrieval
method
for
similar
data
retrieval
.
The
standard
TF-IDF
(
Term
Frequency
and
Inverse
Document
Frequency
)
term
weighting
scheme
is
used
to
measure
the
similarity
between
the
test
sentence
and
the
training
sentence
.
TF-IDF
is
a
similarity
measure
widely
used
in
information
retrieval
.
Each
document
Di
is
represented
as
a
vector
(
wn
,
wi2
,
.
.
.
,
win
)
,
n
is
the
size
of
the
vocabulary
.
Wj
is
calculate
as
follows
:
tf
j
is
the
term
frequency
(
TF
)
of
the
j-th
word
in
the
vocabulary
in
the
document
D
,
i.e.
the
number
of
occurrences
;
dfj
is
the
inverse
document
frequency
(
IDF
)
of
the
j-th
word
calculated
as
below
:
#
documents
idfj
=
-
.
#
documents
containing
j
-
th
term
The
similarity
between
two
documents
is
then
defined
as
the
cosine
of
the
angle
between
the
two
vectors
.
We
perform
information
retrieval
using
the
Lemur
toolkit2
.
The
source
language
part
of
the
parallel
training
data
is
used
as
the
document
collection
.
Each
sentence
represents
one
document
.
Each
sentence
from
the
test
data
or
test
domain
is
used
as
one
separate
query
.
In
the
sentence
retrieval
process
,
both
the
query
and
the
document
are
converted
into
vectors
by
assigning
a
term
weight
to
each
word
.
Then
the
cosine
similarity
is
calculated
proportional
to
the
inner
product
of
the
two
vectors
.
All
retrieved
sentences
are
ranked
according
to
their
similarity
with
the
query
.
We
pair
each
of
the
retrieved
sentences
with
the
corresponding
target
part
and
the
top
N
most
similar
sentences
pairs
are
put
together
to
form
an
adapted
parallel
data
.
N
ranges
from
one
to
several
thousand
in
our
experiments
.
Since
Lemur
toolkit
gives
the
similarity
score
for
each
retrieved
sentences
,
it
is
also
possible
to
select
the
most
similar
sentences
according
to
the
similarity
score
.
Note
that
the
selected
similar
data
can
contain
duplicate
sentences
as
the
top
N
retrieval
results
for
different
test
sentences
can
contain
the
same
training
sentences
.
The
duplicate
sentences
will
force
the
translation
probability
towards
the
more
often
seen
words
.
Intuitively
,
this
could
help
.
In
experiment
section
,
we
will
compare
experimental
results
by
keeping
or
removing
duplicates
to
see
how
the
duplicate
sentences
affect
the
translations
.
The
selected
subset
contains
the
similar
sentences
with
the
test
data
or
test
domain
.
It
matches
the
test
data
better
in
domain
,
topic
and
style
.
Hopefully
,
training
translation
model
using
this
adapted
parallel
data
may
helpful
for
improving
translation
performance
.
In
addition
,
the
translation
model
trained
using
the
selected
subset
is
usually
much
smaller
than
that
trained
using
the
whole
translation
data
.
Limiting
the
size
of
translation
model
is
very
important
for
some
real
applications
.
Since
SMT
systems
usually
require
large
computation
resource
.
The
complexity
of
standard
training
and
decoding
algorithm
depends
mainly
on
the
size
of
the
parallel
training
data
and
the
size
of
the
translation
model
.
Limiting
the
size
of
the
training
data
with
the
similar
translation
performance
would
also
reduce
the
memories
and
speed
up
the
translations
.
In
the
information
retrieval
process
,
we
only
use
the
source
language
part
for
document
indexing
and
query
generating
.
It
is
easy
to
get
source
part
of
the
test
data
.
This
is
different
from
the
common
language
model
adaptation
methods
,
which
have
to
do
at
lease
one
pass
machine
translation
to
get
the
candidate
English
translation
as
query
(
Zhao
2004
,
Zhang
2006
)
.
So
our
method
has
the
advantage
that
it
is
independent
from
the
quality
of
baseline
translation
system
.
2.2
Training
data
optimization
There
are
two
factors
on
training
data
that
influence
the
translation
performance
of
SMT
system
:
the
scale
and
the
quality
.
In
some
sense
,
we
improve
the
quality
of
the
training
data
by
selecting
the
similar
sentence
to
form
an
adapted
training
set
.
However
,
we
also
reduce
the
scale
of
the
training
data
at
the
same
time
.
Although
this
is
helpful
for
some
small
device
applications
,
it
is
also
possible
to
induce
the
data
sparseness
problem
.
Here
,
we
introduce
a
method
to
optimize
between
the
scale
and
the
quality
of
the
training
data
.
The
basic
idea
is
that
we
still
use
all
the
available
training
data
;
by
redistributing
the
weight
of
each
sentence
pairs
we
adapt
the
whole
training
data
to
the
test
domain
.
In
our
experiments
,
we
simply
combine
the
selected
small
similar
subset
and
the
whole
training
data
.
The
weights
of
each
sentence
pairs
are
changed
accordingly
.
Figure
1
shows
the
procedure
of
the
optimization
.
Information
Retrieval
Mode
]
Original
Corpus
Adapted
Corpus
Retrieval
Optimized
Carpus
Figure
1
.
Training
data
optimization
As
can
be
seen
,
through
the
optimization
,
the
weight
of
the
similar
sentence
pairs
are
increased
,
while
the
general
sentence
pairs
still
have
an
ordinary
weight
.
This
make
the
translation
model
inclined
to
give
higher
probabilities
to
the
adapted
words
,
and
at
the
same
time
avoid
the
data
sparse-ness
problem
.
Since
we
only
change
the
weight
of
the
sentence
pairs
,
and
no
new
training
data
is
introduced
,
the
translation
model
size
trained
on
the
optimized
data
will
keep
as
the
same
as
the
original
one
.
We
use
GIZA++
toolkit3
for
word
align
-
ment
training
in
the
training
process
.
The
input
training
file
formats
for
GIZA++
is
as
follows
:
Each
training
sentence
pair
is
stored
in
three
lines
.
The
first
line
is
the
number
of
times
this
sentence
pair
occurred
.
The
second
line
is
the
source
sentence
where
each
token
is
replaced
by
its
unique
integer
id
and
the
third
is
the
target
sentence
in
the
same
format
.
To
deal
with
our
optimized
training
data
,
we
only
need
to
change
the
number
of
sentence
pairs
in
the
first
line
accordingly
.
This
will
not
call
for
extra
training
time
and
memory
for
the
whole
training
process
.
It
might
be
beneficial
to
investigate
other
sophisticated
weighting
schemes
under
the
similar
idea
,
such
as
to
give
more
precise
fractional
weights
to
the
sentences
according
the
retrieval
similarity
scores
.
3
Online
model
optimization
In
most
circumstances
,
we
don
't
know
exactly
the
test
data
or
the
test
domain
when
we
train
a
machine
translation
system
.
This
results
in
the
fact
that
the
performance
of
the
translation
system
highly
depends
on
the
training
data
and
the
test
data
it
is
used
in
.
To
alleviate
this
blindfold
status
and
maximize
the
potential
of
the
available
training
corpora
,
we
propose
a
novel
online
model
optimization
method
.
The
basic
idea
is
that
:
several
candidate
translation
models
are
prepared
in
training
stage
.
In
particularly
,
a
general
model
is
also
prepared
.
Then
,
in
the
translation
process
,
the
similarity
between
the
input
sentence
and
the
predefined
models
is
calculated
online
to
get
the
weights
of
each
model
.
The
optimized
model
is
used
to
translate
the
input
sentence
.
There
are
two
problems
in
the
method
:
how
to
prepare
submodels
in
training
process
and
how
to
optimize
the
model
weight
online
in
translation
process
.
3.1
Prepare
the
submodels
There
are
several
ways
to
prepare
submodels
in
training
process
.
If
the
training
data
comes
from
very
different
sources
,
we
can
divide
the
data
according
to
its
origins
.
Otherwise
,
we
can
use
clustering
method
to
separate
the
training
corpus
into
several
classes
.
In
addition
,
our
offline
data
adaptation
method
can
also
be
used
for
submodel
preparation
.
For
each
candidate
domain
,
we
can
use
the
source
side
of
a
small
corpus
as
queries
to
extract
a
domain
specific
training
set
.
In
this
case
,
a
sentence
pair
in
the
training
data
may
occur
in
several
sub
training
data
,
but
this
doesn
't
matter
.
The
general
model
is
used
when
the
online
input
is
not
similar
to
any
prepared
submodels
.
We
can
use
all
available
training
data
to
train
the
general
model
since
generally
larger
data
can
get
better
model
even
there
are
some
noises
.
3.2
Online
model
weighting
We
also
use
TF-IDF
information
retrieval
method
for
online
model
weighting
.
The
procedure
is
as
follows
:
For
each
input
sentence
:
Do
IR
on
training
data
collection
,
using
the
input
sentence
as
query
.
Determine
the
weights
of
submodels
according
to
the
retrieved
sentences
.
Use
the
optimized
model
to
translate
the
sentence
.
The
information
retrieval
process
is
the
same
as
the
offline
data
selection
except
that
each
retrieved
sentence
is
attached
with
the
sub-corpus
information
,
i.e.
it
belongs
to
which
sub-models
in
the
training
process
.
With
the
sub-corpus
information
,
we
can
calculate
the
weights
of
submodels
.
We
get
the
top
N
most
similar
sentences
,
and
then
calculate
proportions
of
each
submodel
's
sentences
.
The
proportion
can
be
calculated
use
the
count
of
the
sentences
or
the
similarity
score
of
the
sentences
.
The
weight
of
each
submodel
can
be
determined
according
to
the
proportions
.
Our
optimized
model
is
the
log
linear
interpolation
of
the
sub-models
as
follows
:
where
,
p0
is
the
probability
of
general
model
,
pi
is
the
probability
of
submodel
i.
S0
is
the
weight
of
general
model
.
Si
is
the
weight
of
submodel
i.
Each
model
i
is
also
implemented
using
log
linear
model
in
our
SMT
system
.
So
after
the
log
operation
,
the
submodels
are
interpolated
linearly
.
In
our
experiments
,
the
interpolation
factor
Si
is
determined
using
the
following
four
simple
weighting
schemes
:
i^max_model
Weighting
scheme
1
:
c
_
r
\
.
c
_
-
I
.
Weighting
scheme
2
:
Use
weighting
schemel
;
else
Use
weighting
scheme3
;
else
where
,
models
is
the
i-th
submodel
,
i
=
(
1
.
.
.
M
)
.
Proportion
(
modeli
)
is
the
proportion
of
modeli
in
the
retrieved
results
.
We
use
count
for
proportion
calculation
.
max_model
is
the
submodel
with
the
max
proportion
score
.
The
training
and
translation
procedure
of
online
model
optimization
is
illustrated
in
Figure
2
.
Training
procedure
Training
Corpora
Translation
procedure
Figure
2
.
Online
model
optimization
The
online
model
optimization
method
makes
it
possible
to
select
suitable
models
for
each
individual
test
sentence
.
Since
the
IR
process
is
done
on
a
fixed
training
data
,
the
size
of
the
index
data
is
quite
small
compared
with
the
web
IR
.
The
IR
process
will
not
take
much
time
in
the
translation
.
4
Experiments
and
evaluation
4.1
Experimental
setting
We
conduct
our
experiments
on
Chinese-to-English
translation
tasks
.
The
baseline
system
is
a
variant
of
the
phrase-base
SMT
system
,
implemented
using
log-linear
translation
model
(
He
et
al.
2006
)
.
The
baseline
SMT
system
is
used
in
all
experiments
.
The
only
difference
between
them
is
that
they
are
trained
on
different
parallel
training
data
.
In
training
process
,
we
use
GIZA++4
toolkit
for
word
alignment
in
both
translation
directions
,
and
apply
"
grow-diag-final
"
method
to
refine
it
(
Koehn
et
al.
,
2003
)
.
We
change
the
preprocess
part
of
GIZA++
toolkit
to
make
it
accept
the
weighted
training
data
.
Then
we
use
the
same
criterion
as
suggested
in
(
Zens
et
al.
,
2002
)
to
do
phrase
extraction
.
For
the
log-linear
model
training
,
we
take
minimum-error-rate
training
method
as
described
in
(
Och
,
2003
)
.
The
language
model
is
trained
using
Xinhua
portion
of
Gigaword
with
about
190M
words
.
SRI
Language
Modeling
toolkit5
is
used
to
train
a
4-gram
model
with
modified
Kneser-Ney
smoothing
(
Chen
and
Goodman
,
1998
)
.
All
experiments
use
the
same
language
model
.
This
ensures
that
any
differences
in
performance
are
caused
only
by
differences
in
the
parallel
training
data
.
Our
training
data
are
from
three
LDC
corpora
as
shown
in
Table
1
.
We
random
select
200,000
sentence
pairs
from
each
corpus
and
combine
them
together
as
the
baseline
corpus
,
which
includes
16M
Chinese
words
and
19M
English
words
in
total
.
This
is
the
usual
case
when
we
train
a
SMT
system
,
i.e.
we
simply
combine
all
corpora
from
different
origins
to
get
a
larger
training
corpus
.
as
our
development
set
,
and
the
2005
NIST
MT
test
data
as
the
test
set
in
offline
data
optimization
experiments
.
In
both
data
,
each
sentence
has
four
human
translations
as
references
.
The
translation
quality
is
evaluated
by
BLEU
metric
(
Papineni
et
al.
,
2002
)
,
as
calculated
by
mteval-v11b.pl6
with
case-sensitive
matching
of
n-grams
.
LDC
No.
Description
FBIS
Multilanguage
Texts
HK
Hansards
Hong
Kong
Hansards
Text
Hong
Kong
News
Text
All
above
data
4.2
Baseline
experiments
We
first
train
translation
models
on
each
sub
training
corpus
and
the
baseline
corpus
.
The
development
set
is
used
to
tune
the
feature
weights
.
The
results
on
test
set
are
shown
in
Table
2
.
BLEU
on
dev
set
BLEU
on
test
set
HK_Hansards
Table
2
.
Baseline
results
From
the
results
we
can
see
that
although
the
size
of
each
sub
training
corpus
is
similar
,
the
translation
results
from
the
corresponding
system
are
quite
different
on
the
same
test
set
.
It
seems
that
the
FBIS
corpus
is
much
similar
to
the
test
set
than
the
other
two
corpora
.
In
fact
,
it
is
the
case
.
The
FBIS
contains
text
mainly
from
China
mainland
news
stories
,
while
the
2005
NIST
test
set
also
include
lots
of
China
news
text
.
The
results
illustrate
the
importance
of
selecting
suitable
training
data
.
When
combining
all
the
sub
corpora
together
,
the
baseline
system
gets
a
little
better
result
than
the
sub
systems
.
This
indicates
that
larger
data
is
useful
even
it
includes
some
noise
data
.
However
,
compared
with
the
FBIS
corpus
,
the
baseline
corpus
contains
three
times
larger
data
,
while
the
improvement
of
translation
result
is
not
significant
.
This
indicates
that
simply
putting
different
corpora
together
is
not
a
good
way
to
make
use
of
the
available
corpora
.
5
http
:
/
/
www.speech.sri.com
/
projects
/
srilm
/
6http
:
/
/
www.nist.gov
/
speech
/
tests
/
mt
/
resources
/
scoring.htm
4.3
Offline
data
optimization
experiments
We
use
baseline
corpus
as
initial
training
corpus
,
and
take
Lemur
toolkit
to
build
document
index
on
Chinese
part
of
the
corpus
.
The
Chinese
sentences
in
development
set
and
test
set
are
used
as
queries
.
For
each
query
,
N
=
100
,
200
,
500
,
1000
,
2000
similar
sentences
are
retrieved
from
the
indexed
collection
.
The
extracted
similar
sentence
pairs
are
used
to
train
the
new
adapted
translation
models
.
Table
3
illustrates
the
results
.
We
give
the
distinct
pair
numbers
for
each
adapted
set
and
compare
the
size
of
the
translation
models
.
To
illustrate
the
effect
of
duplicate
sentences
,
we
also
give
the
results
with
duplicates
and
without
duplicates
(
distinct
)
.
Size
of
trans
model
BLEU
on
duplicates
BLEU
on
distinct
Table
3
.
Offline
data
adaptation
results
The
results
show
that
:
By
using
similar
data
selection
,
it
is
possible
to
use
much
smaller
training
data
to
get
comparable
or
even
better
results
than
the
baseline
system
.
When
N
=
200
,
using
only
1
/
4
of
the
training
data
and
1
/
3
of
the
model
size
,
the
adapted
translation
model
achieves
comparable
result
with
the
baseline
model
.
When
N
=
500
,
the
adapted
model
outperforms
the
baseline
model
with
much
less
training
data
.
The
results
indicate
that
relevant
data
is
better
data
.
The
method
is
particular
useful
for
SMT
applications
on
small
device
.
In
general
,
using
duplicate
data
achieves
better
results
than
using
distinct
data
.
This
justifies
our
idea
that
give
a
higher
weight
to
more
similar
data
will
benefit
.
With
the
increase
of
training
data
size
,
the
translation
performance
tends
to
improve
also
.
However
,
when
the
size
of
corpus
achieves
a
certain
scale
,
the
performance
may
drop
.
This
maybe
because
that
with
the
increase
of
the
data
,
noisy
data
may
also
be
included
.
More
and
more
included
noises
may
destroy
the
data
.
It
is
necessary
to
use
a
development
set
to
determine
an
optimal
size
of
N.
We
combine
each
adapted
data
with
the
baseline
corpus
to
get
the
optimized
models
.
The
results
are
shown
in
Table
4
.
We
also
compare
the
adapted
models
(
TopN
)
and
the
optimized
models
(
TopN+
)
in
the
table
.
Without
using
any
additional
data
,
the
optimized
models
achieve
significant
better
results
than
the
baseline
model
by
redistributing
the
weight
of
training
sentences
.
The
optimized
models
also
outperform
adapted
models
when
the
size
of
the
adapted
data
is
small
since
they
make
use
of
all
the
available
data
which
decrease
the
influence
of
data
sparseness
.
However
,
with
the
increase
of
the
adapted
data
,
the
performance
of
optimized
models
is
similar
to
that
of
the
adapted
models
.
Distinct
pairs
BLEU
on
TopN
BLEU
on
TopN+
Table
4
.
Offline
data
optimization
results
4.4
Online
model
optimization
experiments
Since
2005
NIST
MT
test
data
tends
bias
to
FBIS
corpus
too
much
,
we
build
a
new
test
set
to
evaluate
the
online
model
optimization
method
.
We
randomly
select
500
sentences
from
extra
part
of
FBIS
,
HKHansards
and
HKNews
corpus
respectively
(
i.e
the
selected
1500
test
sentences
are
not
included
in
any
of
the
training
set
)
.
The
corresponding
English
part
is
used
as
translation
reference
.
Note
that
there
is
only
one
reference
for
each
test
sentence
.
We
also
include
top
500
sentence
and
their
first
reference
translation
of
2005
NIST
MT
test
data
in
the
new
test
set
.
So
in
total
,
the
new
test
contains
2000
test
sentences
with
one
translation
reference
for
each
sentence
.
The
test
set
is
used
to
simulate
SMT
system
's
online
inputs
which
may
come
from
various
domains
.
The
baseline
translation
results
are
shown
in
Table
5
.
We
also
give
results
on
each
sub
test
set
(
denotes
as
Xcorpus_part
)
.
Please
note
that
the
absolute
BLEU
scores
are
not
comparable
to
the
previous
experiments
since
there
is
only
one
reference
in
this
test
set
.
As
expected
,
using
the
same
domain
data
for
training
and
testing
achieves
the
best
results
as
indicate
by
bold
fonts
.
The
results
demonstrate
again
that
relevant
data
is
better
data
.
To
test
our
online
model
optimization
method
,
we
divide
the
baseline
corpus
according
to
the
origins
of
sub
corpus
.
That
is
,
the
FBIS
,
HK_
Hansards
and
HK_News
models
are
used
as
three
submodels
and
the
baseline
model
is
used
as
general
model
.
The
four
weighting
schemes
described
in
section
3.2
are
used
as
online
weighting
schemes
individually
.
The
experimental
results
are
shown
in
Table
6
.
S_i
indicates
the
system
using
weighting
scheme
i.
HK_
Hansards
Baseline
Table
5
.
Baseline
results
on
new
test
set
FBIS-part
HK_Hans_part
HK_News_part
Whole
test
set
Table
6
.
Online
model
optimization
results
Different
weighting
schemes
don
't
show
significant
improvements
from
each
other
.
However
,
all
the
four
weighting
schemes
achieve
better
results
than
the
baseline
system
.
The
improvements
are
shown
not
only
on
the
whole
test
set
but
also
on
each
part
of
the
sub
test
set
.
The
results
justify
the
effectiveness
of
our
online
model
optimization
method
.
5
Related
work
parallel
training
corpora
,
while
our
work
aims
to
make
better
use
of
existing
parallel
corpora
.
Some
research
has
been
conducted
on
parallel
data
selection
and
adaptation
.
Eck
et
al.
(
2005
)
propose
a
method
to
select
more
informative
sentences
based
on
n-gram
coverage
.
They
use
n-grams
to
estimate
the
importance
of
a
sentence
.
The
more
previously
unseen
n-grams
in
the
sentence
the
more
important
the
sentence
is
.
TF-IDF
weighting
scheme
is
also
tried
in
their
method
,
but
didn
't
show
improvements
over
n-grams
.
This
method
is
independent
of
test
data
.
Their
goal
is
to
decrease
the
amount
of
training
data
to
make
SMT
system
adaptable
to
small
devices
.
Similar
to
our
work
,
Hildebrand
et
al.
(
2005
)
also
use
information
retrieval
method
for
translation
model
adaptation
.
They
select
sentences
similar
to
the
test
set
from
available
in-of-domain
and
out-of-domain
training
data
to
form
an
adapted
translation
model
.
Different
from
their
work
,
our
method
further
use
the
small
adapted
data
to
optimize
the
distribution
of
the
whole
training
data
.
It
takes
the
full
advantage
of
larger
data
and
adapted
data
.
In
addition
,
we
also
propose
an
online
translation
model
optimization
method
,
which
make
it
possible
to
select
adapted
translation
model
for
each
individual
sentence
.
Since
large
scale
monolingual
corpora
are
easier
to
obtain
than
parallel
corpora
.
There
has
some
research
on
language
model
adaptation
recent
years
.
Zhao
et
al.
(
2004
)
and
Eck
et
al.
(
2004
)
introduce
information
retrieval
method
for
language
model
adaptation
.
Zhang
et
al.
(
2006
)
and
Mauser
et
al.
(
2006
)
use
adapted
language
model
for
SMT
re-ranking
.
Since
language
model
is
built
for
target
language
in
SMT
,
one
pass
translation
is
usually
needed
to
generate
n-best
translation
candidates
in
language
model
adaptation
.
Translation
model
adaptation
doesn
't
need
a
pre-translation
procedure
.
Comparatively
,
it
is
more
direct
.
Language
model
adaptation
and
translation
model
adaptation
are
good
complement
to
each
other
.
It
is
possible
that
combine
these
two
adaptation
approaches
could
further
improve
machine
translation
performance
.
6
Conclusion
and
future
work
This
paper
presents
two
new
methods
to
improve
statistical
machine
translation
performance
by
making
better
use
of
the
available
parallel
training
corpora
.
The
offline
data
selection
method
adapts
the
training
corpora
to
the
test
domain
by
retrieving
similar
sentence
pairs
and
redistributing
their
weight
in
the
training
data
.
Experimental
results
show
that
the
selected
small
subset
achieves
comparable
or
even
better
performance
than
the
baseline
system
with
much
less
training
data
.
The
optimized
training
data
can
further
improve
translation
performance
without
using
any
additional
resource
.
The
online
model
optimization
method
adapts
the
translation
model
to
the
online
test
sentence
by
redistributing
the
weight
of
each
predefined
submodels
.
Preliminary
results
show
the
effectiveness
of
the
method
.
Our
work
also
demonstrates
that
in
addition
to
larger
training
data
,
more
relevant
training
data
is
also
important
for
SMT
model
training
.
In
future
work
,
we
will
improve
our
methods
in
several
aspects
.
Currently
,
the
similar
sentence
retrieval
model
and
the
weighting
schemes
are
very
simple
.
It
might
work
better
by
trying
other
sophisticated
similarity
measure
models
or
using
some
optimization
algorithms
to
determine
submodel
's
weights
.
Introducing
language
model
optimization
into
our
system
might
further
improve
translation
performance
.
Acknowledgement
This
work
was
supported
by
National
Natural
Science
Foundation
of
China
,
Contract
No.
60603095
and
60573188
.
