l
Introduction
We
propose
a
domain
specific
model
for
statistical
machine
translation
.
It
is
well-known
that
domain
speciic
language
models
perform
well
in
automatic
speech
recognition
.
We
show
that
domain
speciic
language
and
translation
models
also
beneit
statistical
machine
translation
.
However
,
there
are
two
problems
with
using
domain
speciic
models
.
The
irst
is
the
data
sparse-ness
problem
.
We
employ
an
adaptation
technique
to
overcome
this
problem
.
The
second
issue
is
domain
prediction
.
In
order
to
perform
adaptation
,
the
domain
must
be
provided
,
however
in
many
cases
,
the
domain
is
not
known
or
changes
dynamically
.
For
these
cases
,
not
only
the
translation
target
sentence
but
also
the
domain
must
be
predicted
.
This
paper
focuses
on
the
domain
prediction
problem
for
statistical
machine
translation
.
In
the
proposed
method
,
a
bilingual
training
corpus
,
is
automatically
clustered
into
sub-corpora
.
Each
sub-corpus
is
deemed
to
be
a
domain
.
The
domain
of
a
source
sentence
is
predicted
by
using
its
similarity
to
the
sub-corpora
.
The
predicted
domain
(
sub-corpus
)
speciic
language
and
translation
models
are
then
used
for
the
translation
decoding
.
This
approach
gave
an
improvement
of
2.7
in
BLEU
(
Pa-pineni
et
al.
,
2002
)
score
on
the
IWSLT05
Japanese
to
English
evaluation
corpus
(
improving
the
score
from
52.4
to
55.1
)
.
This
is
a
substantial
gain
and
indicates
the
validity
of
the
proposed
bilingual
cluster
based
models
.
Statistical
models
,
such
as
n-gram
models
,
are
widely
used
in
natural
language
processing
,
for
example
in
speech
recognition
and
statistical
machine
translation
(
SMT
)
.
The
performance
of
a
statistical
model
has
been
shown
to
improve
when
domain
spe-ciic
models
are
used
,
since
similarity
of
statistical
characteristics
between
model
and
target
is
higher
.
For
utilize
of
domain
speciic
models
,
a
training
data
sparseness
and
target
domain
estimation
problems
must
be
resolved
.
In
this
paper
,
we
try
to
estimate
target
domain
sentence
by
sentence
,
considering
cases
where
the
domain
changes
dynamically
.
After
sentence
by
sentence
domain
estimation
,
domain
speciic
models
are
used
for
translation
using
the
adaptation
technique
(
Seymore
et
al.
,
1997
)
.
In
order
to
train
a
classiier
to
predict
the
domain
,
we
used
an
unsupervised
clustering
technique
on
an
unlabelled
bilingual
training
corpus
.
We
regarded
each
cluster
(
sub-corpus
)
as
a
domain
.
Prior
to
translation
,
the
domain
of
the
source
sentence
is
first
predicted
and
this
prediction
is
then
used
for
model
selection
.
The
most
similar
sub-corpus
to
the
translation
source
sentence
is
used
to
represent
its
domain
.
After
the
prediction
is
made
,
domain
speciic
language
and
translation
models
are
used
for
the
translation
.
In
Section
2
we
present
the
formal
basis
for
our
domain
specific
translation
method
.
In
Section
3
we
provide
a
general
overview
of
the
two
sub-tasks
of
domain
speciic
translation
:
domain
prediction
,
and
domain
specific
decoding
.
Section
4
presents
the
domain
prediction
task
in
depth
.
Section
5
offers
a
more
detailed
description
of
the
details
of
domain
specific
decoding
.
Section
6
gives
details
of
the
experiments
and
presents
the
results
.
Finally
,
Section
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
514-523
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
7
offers
a
summary
and
some
concluding
remarks
.
2
Domain
Specific
Models
in
SMT
The
purpose
of
statistical
machine
translation
is
to
ind
the
most
probable
translation
in
the
target
language
e
of
a
given
source
language
sentence
f.
This
search
process
can
be
expressed
formally
by
:
In
this
formula
,
the
target
word
sequence
(
sentence
)
e
is
determined
only
by
the
source
language
word
sequence
f.
However
,
e
is
heavily
dependent
on
not
only
on
f
but
also
on
the
domain
D.
When
the
domain
D
is
given
,
formula
(
1
)
can
be
rewritten
as
the
following
formula
with
the
introduction
of
a
new
probabilistic
variable
D.
This
formula
can
be
re-expressed
using
Bayes
'
Law
.
Here
,
P
(
f
|
e
,
D
)
represents
the
domain
D
specific
translation
model
and
P
(
e
|
D
)
represents
the
domain
D
speciic
language
model
.
When
the
domain
D
is
known
,
domain
speciic
models
can
be
created
and
used
in
the
translation
decoding
process
.
However
,
in
many
cases
,
domain
D
is
unknown
or
changes
dynamically
.
In
these
cases
,
both
the
translation
target
language
sentence
e
and
the
domain
D
must
be
dynamically
predicted
at
the
same
time
.
The
following
equation
represents
the
process
of
domain
speciic
translation
when
the
domain
D
is
being
dynamically
predicted
.
The
major
difference
between
this
equation
and
formula
(
3
)
is
that
the
probabilistic
variable
D
is
the
prediction
target
in
equation
(
4
)
.
In
this
equation
,
P
(
D
|
f
)
represents
the
domain
prediction
and
P
(
e
|
f
,
D
)
represents
the
domain
speciic
translation
.
3
Outline
of
the
Proposed
Method
Our
method
can
be
analysed
into
two
processes
:
an
off-line
process
and
an
on-line
process
.
The
processes
are
depicted
in
igure
1
.
In
the
off-line
process
,
bilingual
sub-corpora
are
created
by
clustering
and
these
clusters
represent
domains
.
Domain
spe-ciic
models
are
then
created
from
the
data
contained
in
the
sub-corpora
in
a
batch
process
.
In
the
on-line
process
,
the
domain
of
the
source
sentence
is
irst
predicted
and
following
this
the
sentence
is
translated
using
models
built
on
data
from
the
appropriate
domain
.
In
this
process
,
the
training
corpus
is
clustered
to
sub-corpora
,
which
are
regarded
as
domains
.
In
SMT
,
a
bilingual
corpus
is
used
to
create
the
translation
model
,
and
typically
,
bilingual
data
together
with
additional
monolingual
corpora
are
used
to
create
the
language
model
.
In
our
method
,
both
the
bilingual
and
monolingual
corpora
are
clustered
.
After
clustering
,
cluster
dependent
(
domain
specific
)
language
and
translation
models
are
created
from
the
data
in
the
clusters
.
A
bilingual
corpus
which
is
comprised
of
the
training
data
for
the
translation
model
,
or
equivalently
the
bilingual
part
of
the
training
data
for
the
language
model
is
clustered
(
see
Section
4.2
)
.
Each
sentence
of
the
additional
monolingual
corpora
(
if
any
)
is
assigned
to
a
bilingual
cluster
(
see
Section
4.3
)
.
For
each
cluster
,
the
domain
specific
(
cluster
dependent
)
language
models
are
created
.
The
domain
speciic
translation
model
is
created
using
only
the
clusters
formed
from
clustering
bilingual
data
.
This
process
is
comprised
of
domain
prediction
and
the
domain
speciic
translation
components
.
The
following
steps
are
taken
for
each
source
sentence
.
Select
the
cluster
to
which
the
source
sentence
belongs
.
Translate
the
source
sentence
using
the
appropriate
domain
speciic
language
and
translation
models
.
4
Domain
Prediction
This
section
details
the
domain
prediction
process
.
To
satisfy
equation
(
4
)
,
both
the
domain
D
and
the
translation
target
word
sequence
e
,
which
maximizes
both
P
(
Df
)
and
P
(
ef
,
D
)
must
be
calculated
at
the
same
time
.
However
,
it
is
dificult
to
make
the
calculations
without
an
approximation
.
Therefore
,
in
the
irst
step
,
we
ind
the
best
candidates
for
D
given
the
input
sentence
f.
In
the
next
step
,
P
(
e
|
f
,
D
)
is
maximized
over
the
candidates
for
D
using
the
following
formula
.
Equation
(
5
)
is
approximation
offollowing
equation
in
that
can
D
is
regarded
as
a
hidden
variable
.
When
the
following
assumptions
are
introduced
to
equation
(
6
)
,
equation
(
5
)
is
obtained
as
an
approximation
.
For
only
one
domain
Di
,
P
(
Dif
)
is
nearly
equal
to
one
.
For
other
domains
,
P
(
Df
)
are
almost
zero
.
P
(
D
|
f
)
can
be
re-written
as
following
equation
.
Therefore
,
we
can
conirm
reasonability
of
this
assumption
by
calculating
P
(
f
|
D
)
P
(
D
)
all
domains
(
P
(
f
)
is
constant
)
.
4.1
Domain
Definition
When
the
domain
is
known
in
advance
,
it
is
usually
expressible
,
for
example
it
could
be
a
topic
that
matches
a
human-defined
category
like
"
sport
"
.
On
the
other
hand
,
when
the
domain
is
delimited
in
an
unsupervised
manner
,
it
is
used
only
as
a
probabilistic
variable
and
does
not
need
to
be
expressed
.
Equation
(
4
)
illustrates
that
a
good
model
will
provide
high
probabilities
to
P
(
D
|
f
)
P
(
e
|
f
,
D
)
for
bilingual
sentence
pairs
(
f
,
e
)
.
For
the
same
reason
,
a
good
domain
deinition
will
lead
to
a
higher
probability
for
the
term
:
P
(
Df
)
P
(
ef
,
D
)
.
Therefore
,
we
deine
the
domain
D
as
that
which
maximizes
P
(
Df
)
P
(
e^
)
(
an
approximation
of
P
(
D
|
f
)
P
(
e
|
f
,
D
)
)
.
This
approximation
ensures
that
the
domain
deinition
is
optimal
for
only
the
language
model
rather
than
both
the
language
and
translation
models
.
P
(
D
|
f
)
P
(
e
|
D
)
can
be
rewritten
as
the
following
equation
using
Bayes
'
Law
.
Here
,
P
(
f
)
is
independent
of
domain
D.
Furthermore
,
we
assume
P
(
D
)
to
be
constant
.
The
following
formula
embodies
the
search
for
the
optimal
domain
.
This
formula
ensures
that
the
search
for
the
domain
maximizes
the
domain
speciic
probabilities
of
both
e
and
f
simultaneously
.
4.2
Clustering
of
the
bilingual
corpus
As
mentioned
above
,
we
maximize
the
domain
specific
probabilities
of
e
and
f
to
ascertain
the
domain
.
We
deine
our
domains
as
sub-corpora
of
the
bilingual
corpus
,
and
these
sub-corpora
are
formed
by
clustering
bilingually
by
entropy
reduction
.
For
this
clustering
,
the
following
extension
of
monolingual
corpus
clustering
is
employed
(
Carter
1994
)
.
The
total
number
of
clusters
(
domains
)
is
given
by
the
user
.
Each
bilingual
sentence
pair
is
randomly
assigned
to
a
cluster
.
For
each
cluster
,
language
models
for
e
and
f
are
created
using
the
bilingual
sentence
pairs
that
belong
to
the
cluster
.
For
each
cluster
,
the
entropy
for
e
and
f
is
calculated
by
applying
the
language
models
from
the
previous
step
to
the
sentences
in
the
cluster
.
The
total
entropy
is
deined
as
the
total
sum
of
entropy
(
for
both
source
and
target
)
for
each
cluster
.
On-line
process
Decoding
Translation
result
Target
language
models
Source
language
models
Bilingual
cluster
Target
language
Source
language
Off-line
process
Bilingual
corpus
Monolingual
corpus
Figure
1
:
Outline
of
the
Proposed
Method
Each
bilingual
sentence
pair
is
re-assigned
to
a
cluster
such
that
the
assignment
minimizes
the
total
entropy
.
The
process
is
repeated
from
step
(
3
)
until
the
entropy
reduction
is
smaller
than
a
given
threshold
.
4.3
Clustering
the
monolingual
corpus
Any
additional
monolingual
corpora
used
to
train
the
language
model
are
also
clustered
.
For
this
clustering
,
the
following
process
is
used
.
First
,
bilingual
clusters
are
created
using
the
above
process
.
For
each
monolingual
sentence
its
entropy
is
calculated
using
all
the
bilingual
cluster
dependent
language
models
and
also
the
general
language
model
(
see
Figure
1
for
a
description
of
the
general
language
model
)
.
If
the
entropy
of
the
general
language
model
is
the
lowest
,
this
sentence
is
not
used
in
the
cluster
dependent
language
models
.
Otherwise
,
the
monolingual
sentence
is
added
to
the
bilingual
cluster
that
results
in
the
lowest
entropy
.
4.4
Domain
prediction
In
the
process
described
in
the
previous
section
we
describe
how
clusters
are
created
,
and
we
deine
our
domains
in
terms
of
these
clusters
.
In
this
step
,
domain
D
is
predicted
using
the
given
source
sentence
f
.
This
prediction
is
equivalent
to
inding
the
D
that
maximizes
P
(
D
|
f
)
.
P
(
D
|
f
)
can
be
re-written
as
P
(
f
D
)
P
(
D
)
/
P
(
f
)
using
Bayes
'
law
.
Here
,
P
(
f
)
is
a
constant
,
and
if
P
(
D
)
is
assumed
to
be
constant
(
this
approximation
is
also
used
in
the
clustering
of
the
bilingual
corpus
)
,
maximizing
the
target
is
reduced
to
the
maximization
ofP
(
f
|
D
)
.
To
maximize
P
(
f
|
D
)
we
simply
select
the
cluster
D
,
that
gives
the
highest
likelihood
ofa
given
source
sentence
f.
5
Domain
specific
decoding
After
domain
prediction
,
domain
speciic
decoding
to
maximize
P
(
ef
,
D
)
,
is
conducted
.
P
(
ef
,
D
)
can
be
re-written
as
the
following
equation
using
Bayes
'
law
.
Here
,
f
is
a
given
constant
and
D
has
already
been
selected
by
the
domain
prediction
process
.
Therefore
,
maximizing
P
(
f
|
e
,
D
)
P
(
e
|
D
)
is
equivalent
to
maximizing
the
above
equation
.
In
P
(
f
^
,
D
)
P
(
e
\
D
)
,
P
(
f
^
,
D
)
is
the
domain
specific
translation
model
and
P
(
e
|
D
)
is
the
domain
speciic
language
model
.
Equation
(
10
)
represents
the
whole
process
of
translation
of
f
into
e
using
domain
D
speciic
models
P
(
f
e
,
D
)
and
P
(
e
D
)
.
5.1
Differences
from
previous
methods
Hasan
et
al.
(
2005
)
proposed
a
cluster
language
model
for
inding
the
domain
D.
This
method
has
three
steps
.
In
the
irst
step
,
the
translation
target
language
corpus
is
clustered
using
human-deined
regular
expressions
.
In
the
second
step
,
a
regular
expression
is
created
from
the
source
sentence
f.
In
the
last
step
,
the
cluster
that
corresponds
to
the
extracted
regular
expression
is
selected
,
and
the
cluster
speciic
language
model
built
from
the
data
in
this
cluster
is
used
for
the
translation
.
The
points
of
difference
are
:
•
In
the
cluster
language
model
,
clusters
are
de-ined
by
human-deined
regular
expressions
.
On
the
other
hand
,
with
the
proposed
method
,
clusters
are
automatically
(
without
human
knowledge
)
deined
and
created
by
the
entropy
reduction
based
method
.
•
In
the
cluster
language
model
,
only
the
translation
target
language
corpus
is
clustered
.
In
the
proposed
method
,
both
the
translation
source
and
target
language
corpora
are
clustered
(
bilingual
clusters
)
.
•
In
the
cluster
language
model
,
only
a
domain
(
cluster
)
speciic
language
model
is
used
.
In
the
proposed
method
,
both
a
domain
speciic
language
model
and
a
domain
speciic
translation
model
are
used
.
5.1.2
Sentence
mixture
language
model
P
(
f
^
)
is
used
instead
of
the
domain
specific
translation
model
P
(
f
D
)
,
this
equation
represents
the
process
of
translation
using
sentence
mixture
language
models
(
Iyer
et
al.
,
1993
)
as
follows
:
The
points
that
differ
from
the
proposed
method
are
as
follows
:
•
In
the
sentence
mixture
model
,
the
mixture
weight
parameters
D
\
are
constant
.
On
the
other
hand
,
in
the
proposed
method
,
weight
parameters
P
(
D
|
f
)
are
estimated
separately
for
each
sentence
.
•
In
the
sentence
mixture
model
,
the
probabilities
of
all
cluster
dependent
language
models
are
summed
.
In
the
proposed
model
,
only
the
cluster
that
gives
the
highest
probability
is
considered
as
approximation
.
•
In
the
proposed
method
,
a
domain
speciic
translation
model
is
also
used
.
6
Experiments
6.1
Japanese
to
English
translation
To
evaluate
the
proposed
model
,
we
conducted
experiments
based
on
a
travel
conversation
task
corpus
.
The
experimental
corpus
was
the
travel
arrangements
task
of
the
BTEC
corpus
(
Takezawa
et
al.
,
2002
)
,
(
Kikui
et
al.
,
2003
)
and
the
language
pair
was
Japanese
and
English
.
The
training
,
development
,
and
evaluation
corpora
are
shown
in
Table
1
.
The
development
and
evaluation
corpora
each
had
sixteen
reference
translations
for
each
sentence
.
This
training
corpus
was
also
used
for
the
IWSLT06
Evaluation
Campaign
on
Spoken
Language
Translation
(
Paul
2006
)
J-E
open
track
,
and
the
evaluation
corpus
was
used
as
the
IWSLT05
evaluation
set
.
6.1.2
Experimental
conditions
For
bilingual
corpus
clustering
,
the
sentence
entropy
must
be
calculated
.
Unigram
language
models
were
used
for
this
calculation
.
The
translation
models
were
pharse-based
(
Zen
et
al.
,
2002
)
created
using
the
GIZA++
toolkit
(
Och
et
al.
,
2003
)
.
The
language
models
for
the
domain
prediction
and
translation
decoding
were
word
trigram
with
Good-Turing
Table
1
:
Japanese
to
English
experimental
corpus
Japanese
Training
Japanese
Development
English
Development
Japanese
Evaluation
backoff
(
Katz
1987
)
.
Ten
cluster
specific
source
language
models
and
a
general
language
model
were
used
for
the
domain
prediction
.
If
the
general
language
model
provided
the
lowest
perplexity
for
an
input
sentence
,
the
domain
speciic
models
were
not
used
for
this
sentence
.
The
SRI
language
modeling
toolkit
(
Stolcke
)
was
used
for
the
creation
of
all
language
models
.
The
PHARAOH
phrase-based
decoder
(
Koehn
2004
)
was
used
for
the
translation
decoding
.
For
tuning
of
the
decoder
's
parameters
,
including
the
language
model
weight
,
minimum
error
training
(
Och
2003
)
with
respect
to
the
BLEU
score
using
was
conducted
using
the
development
corpus
.
These
parameters
were
used
for
the
baseline
conditions
.
During
translation
decoding
,
the
domain
spe-ciic
language
model
was
used
as
an
additional
feature
in
the
log-linear
combination
according
to
the
PHARAOH
decoder
's
option
.
That
is
,
the
general
and
domain
speciic
language
models
are
combined
by
log-linear
rather
than
linear
interpolation
.
The
weight
parameters
for
the
general
and
domain
spe-ciic
language
models
were
manually
tuned
using
the
development
corpus
.
The
sum
of
these
language
model
weights
was
equal
to
the
language
model
weight
in
the
baseline
.
For
the
translation
model
,
the
general
translation
model
(
phrase
table
)
and
domain
speciic
translation
model
were
linearly
combined
.
The
interpolation
parameter
was
again
manually
tuned
using
the
development
corpus
.
In
our
bilingual
clustering
,
the
number
of
clusters
must
be
ixed
in
advance
.
Based
on
the
results
of
preliminary
experiments
to
estimate
model
order
,
ten
clusters
were
used
.
Ifless
than
ten
clusters
were
used
,
domain
speciic
characteristics
cannot
be
represented
.
If
more
than
ten
clusters
were
used
,
data
sparseness
problems
are
severe
,
especially
in
translation
models
.
The
amount
of
sentences
in
each
cluster
is
not
so
different
,
therefore
the
approximation
that
P
(
D
)
is
reasonable
.
Two
samples
ofbilin-gual
clusters
are
recorded
in
the
appendix
"
Sample
of
Cluster
"
.
The
cluster
A.1
includes
many
interrogative
sentences
.
The
reason
is
that
special
words
"
at
the
end
of
Japanese
sentence
with
no
corresponding
word
used
in
English
.
The
cluster
A.2
includes
numeric
expressions
in
both
English
and
Japanese
.
Next
,
we
confirm
the
reasonability
of
the
assumption
used
in
equation
(
5
)
.
For
this
conirmation
,
we
calculate
P
(
Df
)
for
all
D
for
each
f
(
P
(
D
)
is
approximated
as
constant
)
.
For
almost
f
,
only
one
domain
Di
has
a
vary
large
value
compared
with
other
domains
.
Therefore
,
this
approximation
is
conirmed
to
be
reasonable
.
In
this
experiments
,
we
compare
three
ways
ofde-ploying
our
domain
speciic
models
to
a
baseline
.
In
the
irst
method
,
only
the
domain
speciic
language
model
is
used
.
The
ratio
of
the
weight
parameter
for
the
general
model
to
the
domain
speciic
model
was
6
:
4
for
all
the
domain
specific
language
models
.
In
the
second
method
,
only
the
domain
speciic
translation
model
was
used
.
The
ratio
of
the
interpolation
parameter
of
the
general
model
to
the
domain
specific
model
was
3
:
7
for
all
the
domain
specific
models
.
In
the
last
method
,
both
the
domain
speciic
language
and
translation
models
(
LM+TM
)
were
used
.
The
weights
and
interpolation
parameters
were
the
same
as
in
the
irst
and
second
methods
.
The
experimental
results
are
shown
in
Table
2
.
Under
all
of
the
conditions
and
for
all
of
the
evaluation
measures
,
the
proposed
domain
speciic
models
gave
better
performance
than
the
baseline
.
The
highest
performance
came
from
the
system
that
used
both
the
domain
spe-ciic
language
and
translation
models
,
resulting
in
a
2.7
point
BLEU
score
gain
over
the
baseline
.
It
is
a
very
respectable
improvement
.
Appendix
"
Sample
of
Different
Translation
Results
"
recodes
samples
of
different
translation
results
with
and
without
the
domain
speciic
language
and
translation
models
.
In
many
cases
,
better
word
order
is
obtained
in
with
the
domain
speciic
models
.
6.2
Translation
of
ASR
output
In
this
experiment
,
the
source
sentence
used
as
input
to
the
machine
translation
system
was
the
direct
textual
output
from
an
automatic
speech
recognition
(
ASR
)
decoder
that
was
a
component
of
a
speech-to-speech
translation
system
.
The
input
to
our
system
therefore
contained
the
kinds
of
recognition
errors
and
disfluencies
typically
found
in
ASR
output
.
This
experiment
serves
to
determine
the
robustness
of
the
domain
prediction
to
real-world
speech
input
.
The
speech
recognition
process
in
this
experiment
had
a
word
accuracy
of
88.4
%
and
a
sentence
accuracy
of
67.2
%
.
The
results
shown
in
Table
3
clearly
demonstrate
that
the
proposed
method
is
able
to
improve
the
translation
performance
,
even
when
speech
recognition
errors
are
present
in
the
input
sentence
.
6.3
Comparison
with
previous
methods
In
this
section
we
compare
the
proposed
method
to
other
comtemporary
methods
:
the
cluster
language
model
(
CLM
)
and
the
sentence
mixture
model
(
SMix
)
.
The
experimental
results
for
these
methods
were
reported
by
RWTH
Aachen
University
in
IWSLT06
(
Mauser
et
al.
,
2006
)
.
We
evaluated
our
method
using
the
same
training
and
evaluation
corpora
.
These
corpora
were
used
as
the
training
and
development
corpora
in
the
IWSLT06
Chinese
to
English
open
track
,
the
details
are
given
in
Table
4
.
The
English
side
of
the
training
corpus
was
the
same
as
that
used
in
the
earlier
Japanese
to
English
experiments
reported
in
this
paper
.
Each
sentence
in
the
evaluation
corpus
had
seven
reference
translations
.
Our
baseline
performance
was
slightly
different
from
that
reported
in
the
RWTH
experiments
(
21.9
BLEU
socre
for
RWTH
's
system
and
21.7
for
our
system
)
.
Therefore
,
their
improved
baseline
is
shown
for
comparison
.
The
results
are
shown
in
Table
5
.
The
improvements
over
the
baseline
of
our
method
in
both
BLEU
and
NIST
(
Doddington
2002
)
score
were
greater
than
those
for
both
CLM
and
SMix
.
In
particular
,
our
method
showed
im-provent
in
both
the
BLEU
and
NIST
scores
,
this
is
in
contrast
to
the
CLM
and
SMix
methods
which
both
degraded
the
translation
performance
in
terms
ofthe
NIST
score
.
Table
5
:
Comparison
results
with
previous
methods
6.4
Clustering
of
the
monolingual
corpus
Finally
,
we
evaluated
the
proposed
method
when
an
additional
monolingual
corpus
was
incorporated
.
For
this
experiment
,
we
used
the
Chinese
and
English
bilingual
corpora
that
were
used
in
the
NIST
MT06
evaluation
(
NIST
2006
)
.
The
size
of
the
bilingual
training
corpus
was
2.9M
sentence
pairs
.
For
the
language
model
training
,
an
additional
monolingual
corpus
of
1.5M
English
sentences
was
used
.
NIST
2006
development
(
evaluation
set
for
NIST
2005
)
is
used
for
evaluation
.
In
this
experiment
,
the
test
set
language
model
perplexity
of
a
model
built
on
only
the
monolingual
corpus
was
considerably
lower
than
that
of
a
model
built
from
only
the
target
language
sentences
from
the
bilingual
corpus
.
Therefore
,
we
would
expect
the
use
ofthis
monolingual
corpus
to
be
an
important
factor
affecting
the
quality
of
the
translation
system
.
These
perplexities
were
299.9
for
the
model
built
on
only
the
bilingual
corpus
,
200.1
for
the
model
built
on
only
the
monolingual
corpus
,
and
192.5
for
the
model
built
on
a
combination
of
the
bilingual
and
monolingual
corpora
.
For
the
domain
speciic
models
,
50
clusters
were
created
from
the
bilingual
and
monolingual
corpora
.
In
this
experiment
,
only
the
domain
speciic
language
model
was
used
.
The
experimental
results
are
shown
in
Table
6
.
The
results
in
the
table
show
that
the
incorporation
of
the
additional
monolingual
data
has
a
pronounced
beneicial
effect
on
performance
,
the
performance
improved
according
to
all
of
the
evaluation
measures
.
Table
2
:
Japanese
to
English
translation
evaluation
scores
Table
3
:
Evaluation
using
ASR
output
Domain
Specific
LM
Domain
Specific
TM
Domain
Specific
LM+TM
7
Conclusion
We
have
proposed
a
technique
that
utilizes
domain
speciic
models
based
on
bilingual
clustering
for
statistical
machine
translation
.
It
is
well-known
that
domain
speciic
modeling
can
result
in
better
performance
.
However
,
in
many
cases
,
the
target
domain
is
not
known
or
can
change
dynamically
.
In
such
cases
,
domain
determination
and
domain
speciic
translation
must
be
performed
simultaneously
during
the
translation
process
.
In
the
proposed
method
,
a
bilingual
corpus
was
clustered
using
an
entropy
reduction
based
method
.
The
resulting
bilingual
clusters
are
regarded
as
domains
.
Domain
speciic
language
and
translation
models
are
created
from
the
data
within
each
bilingual
cluster
.
When
a
source
sentence
is
to
be
translated
,
its
domain
is
irst
predicted
.
The
domain
prediction
method
selects
the
cluster
that
assigns
the
lowest
language
model
perplexity
to
the
given
source
sentence
.
Translation
then
proceeds
using
a
language
model
and
translation
model
that
are
speciic
to
the
domain
predicted
for
the
source
sentence
.
In
our
experiments
we
used
a
corpus
from
the
travel
domain
(
the
subset
of
the
BTEC
corpus
that
was
used
in
IWSLT06
)
.
Our
experimental
results
clearly
demonstrate
the
effectiveness
of
our
method
.
In
the
Japanese
to
English
translation
experiments
,
the
use
of
our
proposed
method
improved
the
BLEU
score
by
2.7
points
(
from
52.4
to
55.1
)
.
We
compared
our
approach
to
two
previous
methods
,
the
cluster
language
model
and
sentence
mixture
model
.
In
our
experiments
the
proposed
method
yielded
higher
scores
than
either
of
the
competitive
methods
in
terms
of
both
BLEU
and
NIST
.
Moreover
,
our
method
may
also
be
augmented
when
an
additional
monolingual
corpus
is
avaliable
for
building
the
language
model
.
Using
this
approach
we
were
able
to
further
improve
translation
performance
on
the
data
from
the
NIST
MT06
evaluation
task
.
shinshoku
wa
dore
desu
ka
)
•
E
:
are
there
any
baseball
games
today
yakyu
no
shiai
wa
ari
masu
ka
)
•
E
:
where
's
the
nearest
perfumery
no
kousui
ten
wa
doko
desu
ka
)
J
:
(
choshoku
wa
ikura
Table
4
:
Training
and
evaluation
corpora
used
for
comparison
with
previous
methods
#
of
sentence
Total
words
Vocabulary
size
English
Training
Chinese
Training
Chinese
Evaluation
Table
6
:
Experimental
results
with
monolingual
corpus
Baseline
Proposed
•
E
:
i'd
like
extension
twenty
four
please
(
furaitonanba
•
E
:
delta
airlines
flight
one
one
two
boarding
is
delayed
B
Sample
of
Different
Translation
Results
Ref
:
where
is
a
police
station
where
japanese
is
understood
Base
:
japanese
where
's
the
police
station
LM
:
japanese
where
's
the
police
station
TM
:
where
's
the
police
station
where
someone
understands
japanese
LM+TM
:
where
's
the
police
station
where
someone
understands
japanese
