This
paper
proposes
a
method
using
the
existing
Rule-based
Machine
Translation
(
RBMT
)
system
as
a
black
box
to
produce
synthetic
bilingual
corpus
,
which
will
be
used
as
training
data
for
the
Statistical
Machine
Translation
(
SMT
)
system
.
With
the
synthetic
bilingual
corpus
,
we
can
build
an
SMT
system
even
if
there
is
no
real
bilingual
corpus
.
In
our
experiments
using
BLEU
as
a
metric
,
the
system
achieves
a
relative
improvement
of
11.7
%
over
the
best
RBMT
system
that
is
used
to
produce
the
synthetic
bilingual
corpora
.
We
also
interpolate
the
model
trained
on
a
real
bilingual
corpus
and
the
models
trained
on
the
synthetic
bilingual
corpora
.
The
interpolated
model
achieves
an
absolute
improvement
of
0.0245
BLEU
score
(
13.1
%
relative
)
as
compared
with
the
individual
model
trained
on
the
real
bilingual
corpus
.
1
Introduction
Within
the
Machine
Translation
(
MT
)
field
,
by
far
the
most
dominant
paradigm
is
SMT
,
but
many
existing
commercial
systems
are
rule-based
.
In
this
research
,
we
are
interested
in
answering
the
question
of
whether
the
existing
RBMT
systems
could
be
helpful
to
the
development
of
an
SMT
system
.
To
find
the
answer
,
let
us
first
consider
the
following
facts
:
•
Existing
RBMT
systems
are
usually
provided
as
a
black
box
.
To
make
use
of
such
systems
,
the
most
convenient
way
might
be
working
on
the
translation
results
directly
.
•
SMT
methods
rely
on
bilingual
corpus
.
As
a
data
driven
method
,
SMT
usually
needs
large
bilingual
corpus
as
the
training
data
.
Based
on
the
above
facts
,
in
this
paper
we
propose
a
method
using
the
existing
RBMT
system
as
a
black
box
to
produce
a
synthetic
bilingual
cor-pus1
,
which
will
be
used
as
the
training
data
for
the
SMT
system
.
For
a
given
language
pair
,
the
monolingual
corpus
is
usually
much
larger
than
the
real
bilingual
corpus
.
We
use
the
existing
RBMT
system
to
translate
the
monolingual
corpus
into
synthetic
bilingual
corpus
.
Then
,
even
if
there
is
no
real
bilingual
corpus
,
we
can
train
an
SMT
system
with
the
monolingual
corpus
and
the
synthetic
bilingual
corpus
.
If
there
exist
n
available
RBMT
systems
for
the
desired
language
pair
,
we
use
the
n
systems
to
produce
n
synthetic
bilingual
corpora
,
and
n
translation
models
are
trained
with
the
n
corpora
respectively
.
We
name
such
a
model
the
synthetic
model
.
An
interpolated
translation
model
is
built
by
linear
interpolating
the
n
synthetic
models
.
In
our
experiments
using
BLEU
(
Papineni
et
al.
,
2002
)
as
the
metric
,
the
interpolated
synthetic
model
achieves
a
relative
improvement
of
11.7
%
over
the
best
RBMT
system
that
is
used
to
produce
the
synthetic
bilingual
corpora
.
1
In
this
paper
,
to
be
distinguished
from
the
real
bilingual
corpus
,
the
bilingual
corpus
generated
by
the
RBMT
system
is
called
a
synthetic
bilingual
corpus
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
287-295
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
Moreover
,
if
a
real
bilingual
corpus
is
available
for
the
desired
language
pair
,
we
build
another
translation
model
,
which
is
named
the
standard
model
.
Then
we
can
build
an
interpolated
model
by
interpolating
the
standard
model
and
the
synthetic
models
.
Experimental
results
show
that
the
interpolated
model
achieves
an
absolute
improvement
of
0.0245
BLEU
score
(
13.1
%
relative
)
as
compared
with
the
standard
model
.
The
remainder
of
this
paper
is
organized
as
follows
.
In
section
2
we
summarize
the
related
work
.
We
then
describe
our
method
Using
RBMT
systems
to
produce
bilingual
corpus
for
SMT
in
section
3
.
Section
4
describes
the
resources
used
in
the
experiments
.
Section
5
presents
the
experiment
result
,
followed
by
the
discussion
in
section
6
.
Finally
,
we
conclude
and
present
the
future
work
in
section
7
.
2
Related
Work
In
the
MT
field
,
by
far
the
most
dominant
paradigm
is
SMT
.
SMT
has
evolved
from
the
original
word-based
approach
(
Brown
et
al.
,
1993
)
into
phrase-based
approaches
(
Koehn
et
al.
,
2003
;
Och
and
Ney
,
2004
)
and
syntax-based
approaches
(
Wu
,
1997
;
Alshawi
et
al.
,
2000
;
Yamada
and
Knignt
,
2001
;
Chiang
,
2005
)
.
On
the
other
hand
,
much
important
work
continues
to
be
carried
out
in
Example-Based
Machine
Translation
(
EBMT
)
(
Carl
et
al.
,
2005
;
Way
and
Gough
,
2005
)
,
and
many
existing
commercial
systems
are
rule-based
.
Although
we
are
not
aware
of
any
previous
attempt
to
use
an
existing
RBMT
system
as
a
black
box
to
produce
synthetic
bilingual
training
corpus
for
general
purpose
SMT
systems
,
there
exists
a
great
deal
of
work
on
MT
hybrids
and
Multi-Engine
Machine
Translation
(
MEMT
)
.
framework
with
phrase-based
SMT
for
spoken
language
translation
in
a
limited
domain
.
They
automatically
generated
a
corpus
of
English-Chinese
pairs
from
the
same
interlingual
representation
by
parsing
the
English
corpus
and
then
paraphrasing
each
utterance
into
both
English
and
Chinese
.
Frederking
and
Nirenburg
(
1994
)
produced
the
first
MEMT
system
by
combining
outputs
from
three
different
MT
engines
based
on
their
knowledge
of
the
inner
workings
of
the
engines
.
Nomoto
(
2004
)
used
voted
language
models
to
select
the
best
output
string
at
sentence
level
.
Some
recent
approaches
to
MEMT
used
word
alignment
techniques
for
comparison
between
the
MT
systems
(
Jayaraman
and
Lavie
,
2005
;
Zaanen
and
Somers
,
systems
operate
on
MT
outputs
for
complete
input
sentences
.
Mellebeek
et
al.
(
2006
)
presented
a
different
approach
,
using
a
recursive
decomposition
algorithm
that
produces
simple
chunks
as
input
to
the
MT
engines
.
A
consensus
translation
is
produced
by
combining
the
best
chunk
translation
.
This
paper
uses
RBMT
outputs
to
improve
the
performance
of
SMT
systems
.
Instead
of
RBMT
outputs
,
other
researchers
have
used
SMT
outputs
to
boost
translation
quality
.
Callision-Burch
and
Osborne
(
2003
)
used
co-training
to
extend
existing
parallel
corpora
,
wherein
machine
translations
are
selectively
added
to
training
corpora
with
multiple
source
texts
.
They
also
created
training
data
for
a
language
pair
without
a
parallel
corpus
by
using
multiple
source
texts
.
Ueffing
(
2006
)
explored
monolingual
source-language
data
to
improve
an
existing
machine
translation
system
via
self-training
.
The
source
data
is
translated
by
a
SMT
system
,
and
the
reliable
translations
are
automatically
identified
.
Both
of
the
methods
improved
translation
quality
.
In
this
paper
,
we
use
the
synthetic
and
real
bilingual
corpus
to
train
the
phrase-based
translation
models
.
According
to
the
translation
model
presented
in
(
Koehn
et
al.
,
2003
)
,
given
a
source
sentence
f
,
the
best
target
translation
ebest
can
be
obtained
using
the
following
model
ficients
,
ensuring
^
a
=
1
and
^lfJi
=
1
.
Where
the
translation
model
p
(
f
|
e
)
can
be
decomposed
into
Where
&lt;
(
fi
|
ei
)
is
the
phrase
translation
probability
.
ai
denotes
the
start
position
of
the
source
phrase
that
was
translated
into
the
ith
target
phrase
,
and
bi-1
denotes
the
end
position
of
the
source
phrase
translated
into
the
(
i-1
)
th
target
phrase
.
d
(
ai
-
bi-1
)
is
the
distortion
probability
.
pw
(
f
i
|
ei
,
a
)
is
the
lexical
weight
,
and
X
is
the
strength
of
the
lexical
weight
.
3.2
Interpolated
Models
We
train
synthetic
models
with
the
synthetic
bilingual
corpus
produced
by
the
RBMT
systems
.
We
can
also
train
a
translation
model
,
namely
standard
model
,
if
a
real
bilingual
corpus
is
available
.
In
order
to
make
full
use
of
these
two
kinds
of
corpora
,
we
conduct
linear
interpolation
between
them
.
In
this
paper
,
the
distortion
probability
in
equation
(
2
)
is
estimated
during
decoding
,
using
the
same
method
as
described
in
Pharaoh
(
Koehn
,
2004
)
.
For
the
phrase
translation
probability
and
lexical
weight
,
we
interpolate
them
as
shown
in
(
3
)
and
(
4
)
.
phrase
translation
probability
and
lexical
weight
trained
with
the
real
bilingual
corpus
,
respectively
.
phrase
translation
probability
and
lexical
weight
estimated
by
n
synthetic
corpora
produced
by
the
RBMT
systems
.
ai
and
J3i
are
interpolation
coef
-
4
Resources
Used
in
Experiments
In
the
experiments
,
we
take
English-Chinese
translation
as
a
case
study
.
The
real
bilingual
corpus
includes
494,149
English-Chinese
bilingual
sentence
pairs
.
The
monolingual
English
corpus
is
selected
from
the
English
Gigaword
Second
Edition
,
which
is
provided
by
Linguistic
Data
Consortium
(
LDC
)
(
catalog
number
LDC2005T12
)
.
The
selected
monolingual
corpus
includes
1,087,651
sentences
.
For
language
model
training
,
we
use
part
of
the
Chinese
Gigaword
Second
Edition
provided
by
LDC
(
catalog
number
LDC2005T14
)
.
We
use
41,418
documents
selected
from
the
ZaoBao
Newspaper
and
992,261
documents
from
the
XinHua
News
Agency
to
train
the
Chinese
language
model
,
amounting
to
5,398,616
sentences
.
The
test
set
and
the
development
set
are
from
evaluation
of
machine
translation
.
It
can
be
obtained
from
Chinese
Linguistic
Data
Consortium
(
catalog
number
2005-863-001
)
.
We
use
the
same
494
sentences
in
the
test
set
and
278
sentences
in
the
development
set
.
Each
source
sentence
in
the
test
set
and
the
development
set
has
4
different
references
.
In
this
paper
,
we
use
two
off-the-shelf
commercial
English
to
Chinese
RBMT
systems
to
produce
the
synthetic
bilingual
corpus
.
We
also
need
a
trainer
and
a
decoder
to
perform
phrase-based
SMT
.
We
use
Koehn
's
training
scripts
3
to
train
the
translation
model
,
and
the
SRILM
toolkit
(
Stolcke
,
2002
)
to
train
language
model
.
For
the
decoder
,
we
use
Pharaoh
(
Koehn
,
2004
)
.
We
run
the
decoder
with
its
default
settings
(
maximum
phrase
length
7
)
and
then
use
Koehn
's
implementation
of
minimum
error
rate
training
(
Och
,
2003
)
to
tune
the
feature
weights
on
the
de
-
2
The
full
name
of
HTRDP
is
National
High
Technology
Research
and
Development
Program
of
China
,
also
named
as
863
Program
.
3
It
is
located
at
http
:
/
/
www.statmt.org
/
wmt06
/
shared-task
/
baseline.html
.
velopment
set
.
The
translation
quality
is
evaluated
using
a
well-established
automatic
measure
:
BLEU
score
(
Papineni
et
al.
,
2002
)
.
We
use
the
same
method
described
in
(
Koehn
and
Monz
,
2006
)
to
perform
the
significance
test
.
5
Experimental
Results
5.1
Results
on
Synthetic
Corpus
Only
With
the
monolingual
English
corpus
and
the
English
side
of
the
real
bilingual
corpus
,
we
translate
them
into
Chinese
using
the
two
commercial
RBMT
systems
and
produce
two
synthetic
bilingual
corpora
.
With
the
corpora
,
we
train
two
synthetic
models
as
described
in
section
3.1
.
Based
on
the
synthetic
models
,
we
also
perform
linear
interpolation
as
shown
in
section
3.2
,
without
the
standard
models
.
We
tune
the
interpolation
weights
using
the
development
set
,
and
achieve
the
best
performance
when
a1
=
0.58
,
a2
=
0.42
,
JJ1
=
0.58
,
and
JJ2
=
0.42
.
The
translation
results
on
the
test
set
are
shown
in
Table
1
.
Synthetic
model
1
and
2
are
trained
using
the
synthetic
bilingual
corpora
produced
by
RBMT
system
1
and
RBMT
system
2
,
respectively
.
RBMT
system
1
RBMT
system
2
Interpolated
Synthetic
Model
Table
1
.
Translation
Results
Using
Synthetic
Bilingual
Corpus
From
the
results
,
it
can
be
seen
that
the
interpolated
synthetic
model
obtains
the
best
result
,
with
an
absolute
improvement
of
the
0.0197
BLEU
(
11.7
%
relative
)
as
compared
with
RBMT
system
1
,
and
0.0425
BLEU
(
29.2
%
relative
)
as
compared
with
RBMT
system
2
.
It
is
very
promising
that
our
method
can
build
an
SMT
system
that
significantly
outperforms
both
of
the
two
RBMT
systems
,
using
the
synthetic
bilingual
corpus
produced
by
two
RBMT
systems
.
5.2
Results
on
Real
and
Synthetic
Corpus
With
the
real
bilingual
corpus
,
we
build
a
standard
model
.
We
interpolate
the
standard
model
with
the
two
synthetic
models
built
in
section
5.1
to
obtain
interpolated
models
.
The
translation
results
are
shown
in
Table
2
.
The
interpolation
coefficients
are
both
for
phrase
table
probabilities
and
lexical
weights
.
They
are
also
tuned
using
the
development
set
.
From
the
results
,
it
can
be
seen
that
all
the
three
interpolated
models
perform
not
only
better
than
the
RBMT
systems
but
also
better
than
the
SMT
system
trained
on
the
real
bilingual
corpus
.
The
interpolated
model
combining
the
standard
model
and
the
two
synthetic
models
performs
the
best
,
achieving
a
statistically
significant
improvement
of
about
0.0245
BLEU
(
13.1
%
relative
)
as
compared
with
the
standard
model
with
no
synthetic
corpus
.
It
also
achieves
26.1
%
and
45.8
%
relative
improvement
as
compared
with
the
two
RBMT
systems
respectively
.
The
results
indicate
that
using
the
corpus
produced
by
RBMT
systems
,
the
performance
of
the
SMT
system
can
be
greatly
improved
.
The
results
also
indicate
that
the
more
the
RBMT
systems
are
used
,
the
better
the
translation
quality
is
.
Table
2
.
Translation
Results
Using
Standard
and
Synthetic
Bilingual
Corpus
5.3
Effect
of
Synthetic
Corpus
Size
To
explore
the
relationship
between
the
translation
quality
and
the
scale
of
the
synthetic
bilingual
corpus
,
we
interpolate
the
standard
model
with
the
synthetic
models
trained
with
synthetic
bilingual
corpus
of
different
sizes
.
In
order
to
simplify
the
procedure
,
we
only
use
RBMT
system
1
to
translate
the
1,087,651
monolingual
English
sentences
to
produce
the
synthetic
bilingual
corpus
.
100
%
of
the
synthetic
bilingual
corpus
to
train
different
synthetic
models
.
The
translation
results
of
the
interpolated
models
are
shown
in
Figure
1
.
The
results
indicate
that
the
larger
the
synthetic
bilingual
corpus
is
,
the
better
translation
performance
would
be
.
Real
Bilingual
Corpus
(
%
)
Figure
1
.
Comparison
of
Translation
Results
Using
Synthetic
Bilingual
Corpus
of
Different
Sizes
Another
issue
is
the
relationship
between
the
SMT
performance
and
the
size
of
the
real
bilingual
corpus
.
To
train
different
standard
models
,
we
randomly
build
five
corpora
of
different
sizes
,
which
contain
20
%
,
40
%
,
60
%
,
80
%
,
and
100
%
sentence
pairs
of
the
real
bilingual
corpus
,
respectively
.
As
to
the
synthetic
model
,
we
use
the
same
synthetic
model
1
that
is
described
in
section
5.1
.
Then
we
build
five
interpolated
models
by
performing
linear
interpolation
between
the
synthetic
model
and
the
five
standard
models
respectively
.
The
translation
results
are
shown
in
Figure
2
.
From
the
results
,
we
can
see
that
the
larger
the
real
bilingual
corpus
is
,
the
better
the
performance
of
both
standard
models
and
interpolated
models
would
be
.
The
relative
improvement
of
BLEU
scores
is
up
to
27.5
%
as
compared
with
the
corresponding
standard
models
.
5.5
Results
without
Additional
Monolingual
Corpus
In
all
the
above
experiments
,
we
use
an
additional
English
monolingual
corpus
to
get
more
synthetic
bilingual
corpus
.
We
are
also
interested
in
the
results
without
the
additional
monolingual
corpus
.
In
such
case
,
the
only
English
monolingual
corpus
is
the
English
side
of
the
real
bilingual
corpus
.
We
use
this
smaller
size
of
monolingual
corpus
and
the
real
bilingual
corpus
to
conduct
similar
experiments
as
in
section
5.2
.
The
translation
results
are
shown
in
Table
3
.
From
the
results
,
it
can
be
seen
that
our
method
works
well
even
if
no
additional
monolingual
corpus
is
available
.
We
achieve
a
statistically
signifi
-
Figure
2
.
Comparison
of
Translation
Results
Using
Real
Bilingual
Corpus
of
Different
Sizes
Interpolation
Coefficients
Standard
Table
3
.
Translation
Results
without
Additional
Monolingual
Corpus
Synthetic
Model
1
Synthetic
Model
2
Synthetic
Table
4
.
Numbers
of
Phrase
Pairs
cant
improvement
of
about
0.01
BLEU
(
5.2
%
relative
)
as
compared
with
the
standard
model
without
using
the
synthetic
corpus
.
In
order
to
further
analyze
the
translation
results
,
we
examine
the
overlap
and
the
difference
among
the
phrase
tables
.
The
analytic
results
are
shown
in
Table
4
.
More
phrase
pairs
are
extracted
by
the
synthetic
models
,
about
twice
by
the
synthetic
model
1
in
particular
,
than
those
extracted
by
the
standard
model
.
The
overlap
between
each
model
is
very
low
.
For
example
,
about
6
%
phrase
pairs
extracted
by
the
standard
model
make
appearance
in
both
the
standard
model
and
the
synthetic
model
1
.
This
also
explains
why
the
interpolated
model
outperforms
that
of
the
standard
model
in
Table
3
.
Methods__English
Sentence
/
Chinese
Translations__BLEU
This
move
helps
spur
the
enterprise
to
strengthen
technical
innovation
,
management
innovation
and
the
creation
of
a
brand
name
and
to
strengthen
marketing
,
after-sale
service
,
thereby
fundamentally
enhance
the
enterprise
's
competitiveness
;
Standard
model
RBMT
System
1
RBMT
System
2
Table
5
.
Translation
Example
This
move
to
strengthen
technical
,
management
innovation
-
innovation
d
the
creation
of
a
and
the
creation
of
brand
name
-
a
brand
name
and
to
strengthen
marketing
,
after-sale
service
,
thereby
fundamentally
enhance
the
enterprise
'
s
the
enterprise
competitiveness
'
s
competitiveness
(
shouhou
)
(
he
chuangzao
)
(
shouhoufuwu
)
(
a
)
Results
Produced
by
the
Standard
Model
(
b
)
Results
Produced
by
the
Interpolated
Model
Figure
3
.
Phrase
Pairs
Used
for
Translation
6
Discussion
6.1
Model
Interpolation
vs.
Corpus
Merge
In
section
5
,
we
make
use
of
the
real
bilingual
corpus
and
the
synthetic
bilingual
corpora
by
performing
model
interpolation
.
Another
available
way
is
directly
combining
these
two
kinds
of
corpora
to
train
a
translation
model
,
namely
corpus
merge
.
In
order
to
compare
these
two
methods
,
we
use
RBMT
system
1
to
translate
the
1,087,651
monolingual
English
sentences
to
produce
synthetic
bilingual
corpus
.
Then
we
train
an
SMT
system
with
the
combination
of
this
synthetic
bilingual
corpus
and
the
real
bilingual
corpus
.
The
BLEU
score
of
such
system
is
0.1887
,
while
that
of
the
model
interpolation
system
is
0.2020
.
It
indicates
that
the
model
interpolation
method
is
significantly
better
than
the
corpus
merge
method
.
As
discussed
in
Section
5.5
,
the
number
of
the
overlapped
phrase
pairs
among
the
standard
model
and
the
synthetic
models
is
very
small
.
The
newly
added
phrase
pairs
from
the
synthetic
models
can
assist
to
improve
the
translation
results
of
the
interpolated
model
.
In
this
section
,
we
will
use
an
example
to
further
discuss
the
reason
behind
the
improvement
of
the
SMT
system
by
using
synthetic
bilingual
corpus
.
Table
5
shows
an
English
sentence
and
its
Chinese
translations
produced
by
different
methods
.
And
Figure
3
shows
the
phrase
pairs
used
for
translation
.
The
results
show
that
imperfect
translations
of
RBMT
systems
can
be
also
used
to
boost
the
performance
of
an
SMT
system
.
Phrase
Pairs
New
Pairs
Standard
Model
Interpolated
Model
Table
6
.
Statistics
of
Phrase
Pairs
Further
analysis
is
shown
in
Table
6
.
After
adding
the
synthetic
corpus
produced
by
the
RBMT
systems
,
the
interpolated
model
outperforms
the
standard
models
mainly
for
the
following
two
reasons
:
(
1
)
some
new
phrase
pairs
are
added
into
the
interpolated
model
.
37.6
%
phrase
pairs
(
1993
out
of
5306
)
are
newly
learned
and
used
for
translation
.
For
example
,
the
phrase
pair
"
after-sale
service
&lt;
-
&gt;
Uü
=
fjK#
(
shouhoufuwu
)
"
is
added
;
(
2
)
The
probability
distribution
of
the
phrase
pairs
is
changed
.
For
example
,
the
probabilities
of
the
two
pairs
"
a
brand
name
&lt;
-
&gt;
(
pinpai
)
"
and
"
and
the
creation
of
&lt;
-
&gt;
f
"
frjii
;
(
he
chuangzao
)
"
increase
.
The
probabilities
of
the
other
two
pairs
"
brand
name
&lt;
-
&gt;
rhW
(
pinpai
)
"
and
"
and
the
creation
of
a
&lt;
-
&gt;
fp
IÜja
(
he
jianli
)
"
decrease
.
We
found
that
930
phrase
pairs
,
which
are
also
in
the
phrase
table
of
the
standard
model
,
are
used
by
the
interpolated
model
for
translation
but
not
used
by
the
standard
model
.
According
to
(
Koehn
and
Monz
,
2006
;
Callison-Burch
et
al.
,
2006
)
,
the
RBMT
systems
are
usually
not
adequately
appreciated
by
BLEU
.
We
also
manually
evaluated
the
RBMT
systems
and
SMT
systems
in
terms
of
both
adequacy
and
fluency
as
defined
in
(
Koehn
and
Monz
,
2006
)
.
The
evaluation
results
show
that
the
SMT
system
with
the
interpolated
model
,
which
achieves
the
highest
BLEU
scores
in
Table
2
,
achieves
slightly
better
adequacy
and
fluency
scores
than
the
two
RBMT
systems
.
7
Conclusion
and
Future
Work
We
presented
a
method
using
the
existing
RBMT
system
as
a
black
box
to
produce
synthetic
bilingual
corpus
,
which
was
used
as
training
data
for
the
SMT
system
.
We
used
the
existing
RBMT
system
to
translate
the
monolingual
corpus
into
a
synthetic
bilingual
corpus
.
With
the
synthetic
bilingual
corpus
,
we
could
build
an
SMT
system
even
if
there
is
no
real
bilingual
corpus
.
In
our
experiments
using
BLEU
as
the
metric
,
such
a
system
achieves
a
relative
improvement
of
11.7
%
over
the
best
RBMT
system
that
is
used
to
produce
the
synthetic
bilingual
corpora
.
It
indicates
that
using
the
existing
RBMT
systems
to
produce
a
synthetic
bilingual
corpus
,
we
can
build
an
SMT
system
that
outperforms
the
existing
RBMT
systems
.
We
also
interpolated
the
model
trained
on
a
real
bilingual
corpus
and
the
models
trained
on
the
synthetic
bilingual
corpora
,
the
interpolated
model
achieves
an
absolute
improvement
of
0.0245
BLEU
score
(
13.1
%
relative
)
as
compared
with
the
individual
model
trained
on
the
real
bilingual
cor
-
pus
.
It
indicates
that
we
can
build
a
better
SMT
system
by
leveraging
the
real
and
the
synthetic
bilingual
corpus
.
Further
result
analysis
shows
that
after
adding
the
synthetic
corpus
produced
by
the
RBMT
systems
,
the
interpolated
model
outperforms
the
standard
models
mainly
because
of
two
reasons
:
(
1
)
some
new
phrase
pairs
are
added
to
the
interpolated
model
;
(
2
)
the
probability
distribution
of
the
phrase
pairs
is
changed
.
In
the
future
work
,
we
will
investigate
the
possibility
of
training
a
reverse
SMT
system
with
the
RBMT
systems
.
For
example
,
we
will
investigate
to
train
Chinese-to-English
SMT
system
based
on
natural
English
and
RBMT-generated
synthetic
Chinese
.
