This
paper
proposes
a
learning
and
extracting
method
of
word
sequence
correspondences
from
non-aligned
parallel
corpora
with
Support
Vector
Machines
,
which
have
high
ability
ofthe
generalization
,
rarely
cause
over-fit
for
training
samples
and
can
learn
dependencies
of
features
by
using
a
kernel
function
.
Our
method
uses
features
for
the
translation
model
which
use
the
translation
dictionary
,
the
number
of
words
,
part-of-speech
,
constituent
words
and
neighbor
words
.
Experiment
results
in
which
Japanese
and
English
parallel
corpora
are
used
archived
81.1
%
precision
rate
and
69.0
%
recall
rate
of
the
extracted
word
sequence
correspondences
.
1
Introduction
Translation
dictionaries
used
in
multilingual
natural
language
processing
such
as
machine
translation
have
been
made
manually
,
but
a
great
deal
of
labor
is
required
for
this
work
and
it
is
difficult
to
keep
the
description
of
the
dictionaries
consistent
.
Therefore
,
researches
of
extracting
translation
pairs
from
parallel
corpora
automatically
become
active
recently
(
Gale
and
Church
,
1991
;
Kaji
and
Aizono
,
1996
;
Tanaka
and
Iwasaki
,
1996
;
Kita-mura
and
Matsumoto
,
1996
;
Fung
,
1997
;
Melamed
,
1997
;
Sato
and
Nakanishi
,
1998
)
.
This
paper
proposes
a
learning
and
extracting
method
of
bilingual
word
sequence
correspondences
from
non-aligned
parallel
corpora
with
Support
Vector
Machines
(
SVMs
)
(
Vapnik
,
1999
)
.
SVMs
are
ones
of
large
margin
classifiers
(
Smola
et
al.
,
2000
)
which
are
based
on
the
strategy
where
margins
between
separating
boundary
and
vectors
of
which
elements
express
the
features
of
training
samples
is
maximized
.
Therefore
,
SVMs
have
higer
ability
of
the
generalization
than
other
learning
models
such
as
the
decision
trees
and
rarely
cause
over-fit
for
training
samples
.
In
addition
,
by
using
kernel
functions
,
they
can
learn
non-linear
separating
boundary
and
dependencies
between
the
features
.
Therefore
,
SVMs
have
been
recently
used
for
the
natural
language
processing
such
as
text
categorization
(
Joachims
,
1998
;
Taira
and
Haruno
,
1999
)
,
chunk
identification
(
Kudo
and
Matsumoto
,
2000b
)
,
dependency
structure
analysis
(
Kudo
and
Matsumoto
,
2000a
)
.
The
method
proposed
in
this
paper
does
not
require
aligned
parallel
corpora
which
do
not
exist
too
many
at
present
.
Therefore
,
without
limiting
applicable
domains
,
word
sequence
correspondences
can
been
extracted
.
2
Support
Vector
Machines
SVMs
are
binary
classifiers
which
linearly
separate
d
dimension
vectors
to
two
classes
.
Each
vector
represents
the
sample
which
has
d
features
.
It
is
distinguished
whether
given
sample
X
=
(
x
\
,
x2
,
.
.
.
,
xd
)
belongs
to
Xi
or
X2
by
equation
(
1
)
:
where
g
(
X
)
is
the
hyperplain
which
separates
two
classes
in
which
wX
and
b
are
decided
by
optimization
.
Let
supervise
signals
for
the
training
samples
be
expressed
as
where
Xi
is
a
set
of
positive
samples
and
X2
is
a
set
of
negative
samples
.
If
the
training
samples
can
be
separated
linearly
,
there
could
exist
two
or
more
pairs
of
wX
and
b
that
Figure
1
:
A
separating
hyperplain
satisfy
equation
(
1
)
.
Therefore
,
give
the
following
constraints
:
Figure
1
shows
that
the
hyperplain
which
separates
the
samples
.
In
this
figure
,
solid
line
shows
separating
hyperplain
W
•
X
+
b
=
0
and
two
dotted
lines
show
hyperplains
expressed
by
W
•
X
+
b
=
±1
.
The
constraints
(
3
)
mean
that
any
vectors
must
not
exist
inside
two
dotted
lines
.
The
vectors
on
dotted
lines
are
called
support
vectors
and
the
distance
between
dotted
lines
is
called
a
margin
,
which
equals
to
2
/
|
|
W
|
|
.
The
learning
algorithm
for
SVMs
could
optimize
W
and
b
which
maximize
the
margin
2
/
|
|
W
|
|
or
minimize
|
|
W
|
|
2
/
2
subject
to
constraints
(
3
)
.
According
to
Lagrange
's
theory
,
the
optimization
problem
is
transformed
to
minimizing
the
Lagrangian
L
:
Consequently
,
the
optimization
problem
is
transformed
to
maximizing
the
object
function
D
subject
to
2n
=
1
Ajyj
=
0
and
Aj
&gt;
0
.
For
the
optimal
parameters
A
*
=
arg
max^
D
,
each
training
sample
Xj
where
A
*
&gt;
0
is
corresponding
to
support
vector
.
W
can
be
obtained
from
equation
(
5
)
and
b
can
be
obtained
from
where
X
is
an
arbitrary
support
vector
.
From
equation
(
2
)
(
5
)
,
the
optimal
hyperplain
can
be
expressed
as
the
following
equation
with
optimal
parameters
The
training
samples
could
be
allowed
in
some
degree
to
enter
the
inside
of
the
margin
by
changing
equation
(
3
)
to
:
where
£
&gt;
0
are
called
slack
variables
.
At
this
time
,
the
maximal
margin
problem
is
enhanced
as
minimizing
|
|
W
|
|
2
/
2
+
C
£
"
=
1
where
C
expresses
the
weight
of
errors
.
As
a
result
,
the
problem
is
to
maximize
the
object
function
D
subject
to
£
"
=
1
Ajyj
=
0
and
0
&lt;
Aj
&lt;
C.
For
the
training
samples
which
cannot
be
separated
linearly
,
they
might
be
separated
linearly
in
higher
dimension
by
mapping
them
using
a
nonlinear
function
:
A
linear
separating
in
Rd
for
p
(
X
)
is
same
as
a
nonlinear
separating
in
Rd
for
X.
Let
p
satisfy
where
K
(
X
,
x
'
)
is
called
kernel
function
.
As
a
result
,
the
object
function
is
rewritten
to
and
the
optimal
hyperplain
is
rewritten
to
Note
that
p
does
not
appear
in
equation
(
11
)
(
12
)
.
Therefore
,
we
need
not
calculate
(
in
higher
dimension
.
The
well-known
kernel
functions
are
the
polynomial
kernel
function
(
13
)
and
the
Gaussian
kernel
function
(
14
)
.
A
non-linear
separating
using
one
of
these
kernel
functions
is
corresponding
to
separating
with
consideration
of
the
dependencies
between
the
features
in
Rd.
3
Extracting
Word
Sequence
Correspondences
with
SVMs
The
method
proposed
in
this
paper
can
obtain
word
sequence
correspondences
(
translation
pairs
)
in
the
parallel
corpora
which
include
Japanese
and
English
sentences
.
It
consists
of
the
following
three
steps
:
Make
training
samples
which
include
positive
samples
as
translation
pairs
and
negative
samples
as
non-translation
pairs
from
the
training
corpora
manually
,
and
learn
a
translation
model
from
these
with
SVMs
.
Make
a
set
of
candidates
of
translation
pairs
which
are
pairs
of
phrases
obtained
by
parsing
both
Japanese
sentences
and
English
sentences
.
Extract
translation
pairs
from
the
candidates
by
inputting
them
to
the
translation
model
made
in
step
1
.
3.2
Features
for
the
Translation
Model
To
apply
SVMs
for
extracting
translation
pairs
,
the
candidates
of
the
translation
pairs
must
be
converted
into
feature
vectors
.
In
our
method
,
they
are
composed
of
the
following
features
:
Features
which
use
an
existing
translation
dictionary
.
(
a
)
Bilingual
word
pairs
in
the
translation
dictionary
which
are
included
in
the
candidates
of
the
translation
pairs
.
(
b
)
Bilingual
word
pairs
in
the
translation
dictionary
which
are
co-occurred
in
the
context
in
which
the
candidates
appear
.
Features
which
use
the
number
of
words
.
(
a
)
The
number
of
words
in
Japanese
phrases
.
(
b
)
The
number
of
words
in
English
phrases
.
Features
which
use
the
part-of-speech
.
(
a
)
The
ratios
of
appearance
of
noun
,
verb
,
adjective
and
adverb
in
Japanese
phrases
.
(
b
)
The
ratios
of
appearance
of
noun
,
verb
,
adjective
and
adverb
in
English
phrases
.
Features
which
use
constituent
words
.
(
a
)
Constituent
words
in
Japanese
phrases
.
(
b
)
Constituent
words
in
English
phrases
.
Features
which
use
neighbor
words
.
(
a
)
Neighbor
words
which
appear
in
Japanese
phrases
just
before
or
after
.
(
b
)
Neighbor
words
which
appear
in
English
phrases
just
before
or
after
.
Two
types
of
the
features
which
use
an
existing
translation
dictionary
are
used
because
the
improvement
of
accuracy
can
be
expected
by
effectively
using
existing
knowledge
in
the
features
.
For
features
(
1a
)
,
words
included
in
a
candidate
of
the
translation
pair
are
looked
up
with
the
translation
dictionary
and
the
bilingual
word
pairs
in
the
candidate
become
features
.
They
are
based
on
the
idea
that
a
translation
pair
would
include
many
bilingual
word
pairs
.
Each
bilingual
word
pair
included
in
the
dictionary
is
allocated
to
the
dimension
of
the
feature
vectors
.
If
a
bilingual
word
pair
appears
in
the
candidate
of
translation
pair
,
the
value
of
the
corresponding
dimension
of
the
vector
is
set
to
1
,
and
otherwise
it
is
set
to
0
.
For
features
(
1b
)
,
all
pairs
of
words
which
co-occurred
with
a
candidate
of
the
translation
pair
are
looked
up
with
the
translation
dictionary
and
the
bilingual
word
pairs
in
the
dictionary
become
features
.
They
are
based
on
the
idea
that
the
context
of
the
words
which
appear
in
neighborhood
looks
like
each
other
for
the
translation
pairs
although
expressed
in
the
two
different
languages
(
Kaji
and
Aizono
,
1996
)
.
The
candidates
are
converted
into
the
feature
vectors
just
like
(
1a
)
.
Features
(
2a
)
(
2b
)
are
based
on
the
idea
that
there
is
a
correlation
in
the
number
of
constituent
words
of
the
phrases
of
both
languages
in
the
translation
pair
.
The
number
of
constituent
words
of
each
language
is
used
for
the
feature
vector
.
Features
(
3a
)
(
3b
)
are
based
on
the
idea
that
there
is
a
correlation
in
the
ratio
of
content
words
(
noun
,
verb
,
adjective
and
adverb
)
which
appear
in
the
phrases
of
both
languages
in
a
translation
pair
.
The
ratios
of
the
numbers
of
noun
,
verb
,
adjective
and
adverb
to
the
number
of
words
of
the
phrases
of
each
language
are
used
for
the
feature
vector
.
For
features
(
4a
)
(
4b
)
,
each
content
word
(
noun
,
verb
,
adjective
and
adverb
)
is
allocated
to
the
dimension
of
the
feature
vectors
for
each
language
.
If
a
word
appears
in
the
candidate
of
translation
pair
,
the
value
of
the
corresponding
dimension
of
the
vector
is
set
to
1
,
and
otherwise
it
is
set
to
0
.
For
features
(
5a
)
(
5b
)
,
each
content
words
(
noun
,
verb
,
adjective
and
adverb
)
is
allocated
to
the
dimension
of
the
feature
vectors
for
each
language
.
If
a
word
appears
in
the
candidate
of
translation
pair
just
before
or
after
,
the
value
of
the
corresponding
dimension
of
the
vector
is
set
to
1
,
and
otherwise
it
is
set
to
0
.
3.3
Learning
the
Translation
Model
Training
samples
which
include
positive
samples
as
the
translation
pairs
and
negative
samples
as
the
non-translation
pairs
are
made
from
the
training
corpora
manually
,
and
are
converted
into
the
feature
vectors
by
the
method
described
in
section
3.2
.
For
supervise
signals
yj
,
each
positive
sample
is
assigned
to
+1
and
each
negative
sample
is
assigned
to
-1
.
The
translation
model
is
learned
from
them
by
SVMs
described
in
section
2
.
As
a
result
,
the
optimal
parameters
A
*
for
SVMs
are
obtained
.
3.4
Making
the
Candidate
of
the
Translation
A
set
of
candidates
of
translation
pairs
is
made
from
the
combinations
of
phrases
which
are
obtained
by
parsing
both
Japanese
and
English
sentences
.
How
to
make
the
combinations
does
not
require
sentence
alignments
between
both
languages
.
Because
the
set
grows
too
big
for
all
the
combinations
,
the
phrases
used
for
the
combinations
are
limited
in
upper
bound
of
the
number
of
constituent
words
and
only
noun
phrases
and
verb
phrases
.
3.5
Extracting
the
Translation
Pairs
The
candidates
of
the
translation
pairs
are
converted
into
the
feature
vectors
with
the
method
described
in
section
3.2
.
By
inputting
them
to
equation
(
8
)
with
the
optimal
parameters
A
*
obtained
in
section
3.3
,
+1
or
-1
could
be
obtained
as
the
output
for
each
vector
.
If
the
output
is
+1
,
the
candidate
corresponding
to
the
input
vector
is
the
translation
pair
,
otherwise
it
is
not
the
translation
pair
.
4
Experiments
To
confirm
the
effectiveness
of
the
method
described
in
section
3
,
we
did
the
experiments
where
the
English
Business
Letter
Example
Collection
published
from
Nihon
Keizai
Shimbun
Inc.
are
used
as
parallel
corpora
,
which
include
Japanese
and
English
sentences
which
are
examples
of
business
letters
,
and
are
marked
up
at
translation
pairs
.
As
both
training
and
test
corpora
,
1,000
sentences
were
used
.
The
translation
pairs
which
are
already
marked
up
in
the
corpora
were
corrected
to
the
form
described
in
section
3.4
to
be
used
as
the
positive
samples
.
Japanese
sentences
were
parsed
by
KNP
1
and
English
sentences
were
parsed
by
Apple
Pie
Parser
2
.
The
negative
samples
of
the
same
number
as
the
positive
samples
were
randomly
chosen
from
combinations
of
phrases
which
were
made
by
parsing
and
of
which
the
numbers
of
constituent
words
were
below
8
words
.
As
a
result
,
2,000
samples
(
1,000
positives
and
1,000
negatives
)
for
both
training
and
test
were
prepared
.
The
obtained
samples
must
be
converted
into
the
feature
vectors
by
the
method
described
in
section
3.2
.
For
features
(
1a
)
(
1b
)
,
94,511
bilingual
word
pairs
included
in
EDICT
3
were
prepared
.
For
features
(
4a
)
(
4b
)
(
5
a
)
(
5b
)
,
1,009
Japanese
words
and
890
English
words
which
appeared
in
the
training
corpora
above
3
times
were
used
.
Therefore
,
the
number
of
dimensions
for
the
feature
vectors
was
94
,
511x2+1x2+4x2+1
,
009+890+1
,
009+890
=
192
,
830
.
S
VMljght
4
was
used
for
the
learner
and
the
classifier
of
SVMs
.
For
the
kernel
function
,
the
squared
polynomial
kernel
(
p
=
2
in
equation
(
13
)
)
was
used
,
and
the
error
weight
C
was
set
to
0.01
.
The
translation
model
was
learned
by
the
training
samples
and
the
translation
pairs
were
extracted
from
the
test
samples
by
the
method
described
in
section
3
.
4http
:
/
/
svmlight.joachims.org
/
Figure
2
:
Transition
in
the
precision
rate
and
the
recall
rate
when
the
number
of
the
training
samples
are
increased
Table
1
shows
the
precision
rate
and
the
recall
rate
of
the
extracted
translation
pairs
,
and
table
2
shows
examples
of
the
extracted
translation
pairs
.
Table
1
:
Precision
and
recall
rate
Precision
5
Discussion
Figure
2
shows
the
transition
in
the
precision
rate
and
the
recall
rate
when
the
number
of
the
training
samples
are
increased
from
100
to
2,000
by
every
100
samples
.
The
recall
rate
rose
according
to
the
number
of
the
training
samples
,
and
reaching
the
level-off
in
the
precision
rate
since
1,300
.
Therefore
,
it
suggests
that
the
recall
rate
can
be
improved
without
lowering
the
precision
rate
too
much
by
increasing
the
number
ofthe
training
samples
.
Figure
3
shows
that
the
transition
in
the
precision
rate
and
the
recall
rate
when
the
number
ofthe
bilingual
word
pairs
in
the
translation
dictionary
are
increased
from
0
to
90,000
by
every
5,000
pairs
.
The
precision
rate
rose
almost
linearly
according
to
the
number
of
the
pairs
,
and
reaching
the
level-off
in
the
recall
rate
since
30,000
.
Therefore
,
it
suggests
that
the
precision
rate
can
be
improved
without
lowering
the
recall
rate
too
much
by
increasing
the
number
of
the
bilingual
word
pairs
in
the
translation
dictionary
.
Table
3
shows
the
precision
rate
and
the
recall
rate
when
each
kind
of
features
described
in
section
3.2
was
removed
.
The
values
in
parentheses
in
the
columns
of
the
precision
rate
and
the
recall
rate
are
Figure
3
:
Transition
in
the
precision
rate
and
the
recall
rate
when
the
number
of
the
bilingual
word
pairs
in
the
translation
dictionary
are
increased
differences
with
the
values
when
all
the
features
are
used
.
The
fall
of
the
precision
rate
when
the
features
which
use
the
translation
dictionary
(
1a
)
(
1b
)
were
removed
and
the
fall
of
the
recall
rate
when
the
features
which
use
the
number
of
words
(
2a
)
(
2b
)
were
removed
were
especially
large
.
It
is
clear
that
feature
(
1a
)
(
1b
)
could
restrict
the
translation
model
most
strongly
in
all
features
.
Therefore
,
if
feature
(
1a
)
(
1b
)
were
removed
,
it
causes
a
good
translation
model
not
to
be
able
to
be
learned
only
by
the
features
of
the
remainder
because
of
the
weak
constraints
,
wrong
outputs
increased
,
and
the
precision
rate
has
fallen
.
Only
features
(
2a
)
(
2b
)
surely
appear
in
all
samples
although
some
other
features
appeared
in
the
training
samples
may
not
appear
in
the
test
samples
.
So
,
in
the
test
samples
,
the
importance
of
features
(
2a
)
(
2b
)
are
increased
on
the
coverage
of
the
samples
relatively
.
Therefore
,
if
features
(
2a
)
(
2b
)
were
removed
,
it
causes
the
recall
rate
to
fall
because
of
the
low
coverage
of
the
samples
.
6
Related
Works
With
difference
from
our
method
,
there
have
been
researches
which
are
based
on
the
assumption
of
the
sentence
alignments
for
parallel
corpora
(
Gale
and
Church
,
1991
;
Kitamura
and
Matsumoto
,
1996
;
Melamed
,
1997
)
.
(
Gale
and
Church
,
1991
)
has
used
the
p2
statistics
as
the
correspondence
level
of
the
word
pairs
and
has
showed
that
it
was
more
effective
than
the
mutual
information
.
(
Kitamura
and
Mat-sumoto
,
1996
)
has
used
the
Dice
coefficient
(
Kay
and
Roschesen
,
1993
)
which
was
weighted
by
the
logarithm
of
the
frequency
of
the
word
pair
as
the
Table
2
:
Examples
of
translation
pairs
extracted
by
our
method
Japanese
chairman
of
a
special
program
committee
officially
retired
as
would
like
to
say
an
official
farewell
my
thirty
years
of
experience
sharpen
up
on
my
golf
Table
3
:
Precision
rate
and
recall
rate
when
each
kind
of
features
is
removed
Num
.
Corrects
All
features
correspondence
level
of
the
word
pairs
.
(
Melamed
,
1997
)
has
proposed
the
Competitive
Linking
Algorithm
for
linking
the
word
pairs
and
a
method
which
calculates
the
optimized
correspondence
level
ofthe
word
pairs
by
hill
climbing
.
These
methods
could
archive
high
accuracy
because
of
the
assumption
of
the
sentence
alignments
for
parallel
corpora
,
but
they
have
the
problem
with
narrow
applicable
domains
because
there
are
not
too
many
parallel
corpora
with
sentence
alignments
at
present
.
However
,
because
our
method
does
not
require
sentence
alignments
,
it
can
be
applied
for
wider
applicable
domains
.
Like
our
method
,
researches
which
are
not
based
on
the
assumption
of
the
sentence
alignments
for
parallel
corpora
have
been
done
(
Kaji
and
Aizono
,
1996
;
Tanaka
and
Iwasaki
,
1996
;
Fung
,
1997
)
.
They
are
based
on
the
idea
that
the
context
of
the
words
which
appear
in
neighborhood
looks
like
each
other
for
the
translation
pairs
although
expressed
in
two
different
languages
.
(
Kaji
and
Aizono
,
1996
)
has
proposed
the
correspondence
level
calculated
by
the
size
of
intersection
between
co-occurrence
sets
with
the
word
included
in
an
ex
-
isting
translation
dictionary
.
(
Tanaka
and
Iwasaki
,
1996
)
has
proposed
a
method
for
obtaining
the
bilingual
word
pairs
by
optimizing
the
matrix
ofthe
translation
probabilities
so
that
the
distance
of
the
matrices
of
the
probabilities
of
co-occurrences
of
words
which
appeared
in
each
language
might
become
small
.
(
Fung
,
1997
)
has
calculated
the
vectors
in
which
the
weighted
mutual
information
between
the
word
in
the
corpora
and
the
word
included
in
an
existing
translation
dictionary
was
an
element
,
and
has
used
these
inner
products
as
the
correspondence
level
of
word
pairs
.
There
is
a
common
point
between
these
method
and
ours
on
the
idea
that
the
context
of
the
words
which
appear
in
neighborhood
looks
like
each
other
for
the
translation
pairs
because
features
(
1b
)
are
based
on
the
same
idea
.
However
,
since
our
method
caught
extracting
the
translation
pairs
as
the
approach
of
the
statistical
machine
learning
,
it
could
be
expected
to
improve
the
performance
by
adding
new
features
to
the
translation
model
.
In
addition
,
if
learning
the
translation
model
for
the
training
samples
is
done
once
with
our
method
,
the
model
need
not
be
learned
again
for
new
samples
although
it
needs
the
positive
and
negative
samples
for
the
training
data
.
However
,
the
methods
introduced
above
must
learn
a
new
model
again
for
new
corpora
.
(
Sato
and
Nakanishi
,
1998
)
has
proposed
a
method
for
learning
a
probabilistic
translation
model
with
Maximum
Entropy
(
ME
)
modeling
which
was
the
same
approach
of
the
statistical
machine
learning
as
SVMs
,
in
which
co-occurrence
information
and
morphological
information
were
used
as
features
and
has
archived
58.25
%
accuracy
with
4,119
features
.
ME
modeling
might
be
similar
to
SVMs
on
using
features
for
learning
a
model
,
but
feature
selection
for
ME
modeling
is
more
difficult
because
ME
modeling
is
easier
to
cause
over-fit
for
training
samples
than
SVMs
.
In
addition
,
ME
modeling
cannot
learn
dependencies
between
features
,
but
SVMs
can
learn
them
automatically
using
a
kernel
function
.
Therefore
,
SVMs
could
learn
more
complex
and
effective
model
than
ME
modeling
.
7
Conclusion
In
this
paper
,
we
proposed
a
learning
and
extracting
method
of
bilingual
word
sequence
correspondences
from
non-aligned
parallel
corpora
with
SVMs
.
Our
method
used
features
for
the
translation
model
which
use
the
translation
dictionary
,
the
number
of
words
,
the
part-of-speech
,
constituent
words
and
neighbor
words
.
Experiment
results
in
which
Japanese
and
English
parallel
corpora
are
used
archived
81.1
%
precision
rate
and
69.0
%
recall
rate
of
the
extracted
translation
pairs
.
This
demonstrates
that
our
method
could
reduce
the
cost
for
making
translation
dictionaries
.
Acknowledgments
We
would
like
to
thank
Nihon
Keizai
Shimbun
Inc.
for
giving
us
the
research
application
permission
of
the
English
Business
Letter
Example
Collection
.
