We
propose
a
novel
method
for
Japanese
dependency
analysis
,
which
is
usually
reduced
to
the
construction
of
a
dependency
tree
.
In
deterministic
approaches
to
this
task
,
dependency
trees
are
constructed
by
series
of
actions
of
attaching
a
bunsetsu
chunk
to
one
of
the
nodes
in
the
tree
being
constructed
.
Conventional
techniques
select
the
node
based
on
whether
the
new
bunsetsu
chunk
and
each
node
in
the
trees
are
in
a
parent-child
relation
or
not
.
However
,
tree
structures
include
relations
between
two
nodes
other
than
the
parent-child
relation
.
Therefore
,
we
use
ancestor-descendant
relations
in
addition
to
parent-child
relations
,
so
that
the
added
redundancy
helps
errors
be
corrected
.
Experimental
results
show
that
the
proposed
method
achieves
higher
accuracy
.
1
Introduction
Japanese
dependency
analysis
has
been
recognized
as
one
of
the
basic
techniques
in
Japanese
processing
.
A
number
of
techniques
have
been
proposed
for
years
.
Japanese
dependency
is
usually
represented
by
the
relation
between
phrasal
units
called
'
bunsetsu
'
chunks
,
which
are
the
smallest
meaningful
sequences
consisting
of
an
independent
word
and
accompanying
words
(
e.g.
,
a
noun
and
a
particle
)
.
Hereafter
,
a
'
chunk
'
means
a
bunsetsu
chunk
in
this
paper
.
The
relation
between
two
chunks
has
a
di
-
Akihiro
Tamura
belonged
to
Tokyo
Institute
of
Technology
when
this
work
was
done
.
Translation
:
He
ate
pizza
and
salad
at
lunchtime
.
Figure
1
:
Example
of
a
dependency
tree
rection
from
the
modifier
to
the
modifiee
.
All
dependencies
in
a
sentence
are
represented
by
a
dependency
tree
,
where
a
node
indicates
a
chunk
,
and
node
B
is
the
parent
of
node
A
when
chunk
B
is
the
modifiee
of
chunk
A.
Figure
1
shows
an
example
of
a
dependency
tree
.
The
task
of
Japanese
dependency
analysis
is
to
find
the
modifiee
for
each
chunk
in
a
sentence
.
The
task
is
usually
regarded
as
construction
of
a
dependency
tree
.
In
primitive
approaches
,
the
probabilities
of
dependencies
are
given
by
manually
constructed
rules
and
the
modifiee
of
each
chunk
is
determined
.
However
,
those
rule-based
approaches
have
problems
in
coverage
and
consistency
.
Therefore
,
a
number
of
statistical
techniques
using
machine
learning
algorithms
have
recently
been
proposed
.
In
most
conventional
statistical
techniques
,
the
probabilities
of
dependencies
between
two
chunks
are
learned
in
the
learning
phase
,
and
then
the
modifiee
of
each
chunk
is
determined
using
the
learned
models
in
the
analysis
phase
.
In
terms
of
dependency
trees
,
the
parent
node
of
each
node
is
determined
based
on
the
likeli-ness
of
parent-child
relations
between
two
nodes
.
We
here
take
notice
of
the
characteristics
of
dependencies
which
cannot
be
captured
well
only
by
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
6GG-6G9
,
Prague
,
June
2GG7
.
©
2GG7
Association
for
Computational
Linguistics
the
parent-child
relation
.
In
Figure
1
,
ID
3
(
pizza-and
)
and
ID
4
(
salad-accusative
)
are
in
a
parallel
structure
.
In
the
structure
,
node
4
is
a
child
of
node
5
(
ate
)
,
but
node
3
is
not
a
child
of
5
,
although
3
and
4
are
both
foods
and
should
share
a
tendency
of
being
subcategorized
by
the
verb
"
eat
"
.
A
number
of
conventional
models
use
the
pair
of
3
(
pizza-and
)
and
5
(
ate
)
as
a
negative
instance
because
3
does
not
modify
5
.
Consequently
,
those
models
cannot
learn
and
use
the
sub-categorization
preference
of
verbs
well
in
the
parallel
structures
.
We
focus
on
ancestor-descendant
relations
to
compensate
for
the
weakness
.
Two
nodes
are
in
the
ancestor-descendant
relation
when
one
of
the
two
nodes
is
included
in
the
path
from
the
root
node
to
the
other
node
.
The
upper
node
of
the
two
nodes
is
called
an
'
ancestor
node
'
and
the
lower
node
a
'
descendant
node
'
.
When
the
ancestor-descendant
relation
is
used
,
both
of
the
above
two
instances
for
nodes
3
and
4
can
be
considered
as
positive
instances
.
Therefore
,
it
is
expected
that
the
ancestor-descendant
relation
helps
the
algorithm
capture
the
characteristics
that
cannot
be
captured
well
by
the
parent-child
relation
.
We
aim
to
improve
the
performance
of
Japanese
dependency
analysis
by
taking
the
ancestor-descendant
relation
into
account
.
In
exploiting
ancestor-descendant
information
,
it
came
to
us
that
redundant
information
is
effectively
utilized
in
a
coding
problem
in
communications
(
Mackay
,
2003
)
.
Therefore
,
we
propose
a
method
in
which
the
problem
of
determining
the
modifiee
of
a
chunk
is
regarded
as
a
kind
of
a
coding
problem
:
dependency
is
expressed
as
a
sequence
of
values
,
each
of
which
denotes
whether
a
parent-child
relation
or
an
ancestor-descendant
relation
holds
between
two
chunks
.
In
Section
2
,
we
present
the
related
work
.
In
Section
3
,
we
explain
our
method
.
In
Section
4
,
we
describe
our
experiments
and
their
results
,
where
we
show
the
effectiveness
of
the
proposed
method
.
In
Section
5
,
we
discuss
the
results
of
the
experiments
.
Finally
,
we
describe
the
summary
of
this
paper
and
the
future
work
in
Section
6
.
2
Conventional
Statistical
Methods
for
Japanese
Dependency
Analysis
First
,
we
describe
general
formulation
of
the
probability
model
for
dependency
analysis
.
We
denote
a
sequence
of
chunks
,
"
b
\
,
b2
,
.
.
.
,
bm
"
,
by
B
,
and
a
sequence
of
dependency
patterns
,
"
Dep
(
l
)
,
Dep
(
2
)
,
Dep
(
m
)
"
,
by
D
,
where
Dep
(
i
)
=
j
means
that
bi
modifies
bj.
Given
the
sequence
B
of
chunks
as
an
input
,
dependency
analysis
is
defined
as
the
problem
of
finding
the
sequence
D
of
the
dependency
patterns
that
maximizes
the
conditional
probability
P
(
D
|
B
)
.
A
number
of
the
conventional
methods
assume
that
dependency
probabilities
are
independent
of
each
other
and
approximate
P
(
D
|
B
)
with
Y^-1
P
(
Dep
(
i
)
|
B
)
.
P
(
Dep
(
i
)
I
B
)
is
estimated
using
machine
learning
algorithms
.
For
example
,
Haruno
et
al.
(
1999
)
used
Decision
Trees
,
Sekine
(
2000
)
used
Maximum
Entropy
Models
,
Kudo
and
Matsumoto
(
2000
)
used
Support
Vector
Machines
.
Another
notable
method
is
Cascaded
Chunking
Model
by
Kudo
and
Matsumoto
(
2002
)
.
In
their
model
,
a
sentence
is
parsed
by
series
of
the
following
processes
:
whether
or
not
the
current
chunk
modifies
the
following
chunk
is
estimated
,
and
if
it
is
so
,
the
two
chunks
are
merged
together
.
Sassano
(
2004
)
parsed
a
sentence
efficiently
using
a
stack
.
The
stack
controls
the
modifier
being
analyzed
.
These
conventional
methods
determine
the
modifiee
of
each
chunk
based
on
the
likeliness
of
dependencies
between
two
chunks
(
in
terms
of
dependency
tree
,
the
likeliness
of
parent-child
relations
between
two
nodes
)
.
The
difference
between
the
conventional
methods
and
the
proposed
method
is
that
the
proposed
method
determines
the
modifiees
based
on
the
likeliness
of
ancestor-descendant
relations
in
addition
to
parent-child
relations
,
while
the
conventional
methods
tried
to
capture
characteristics
that
cannot
be
captured
by
parent-child
relations
,
by
adding
ad-hoc
features
such
as
features
of
"
the
chunk
modified
by
the
candidate
modifiee
"
to
features
of
the
candidate
modifiee
and
the
modifier
.
However
,
these
methods
do
not
deal
with
ancestor-descendant
relations
between
two
chunks
directly
,
while
our
method
uses
that
information
directly
.
In
Section
5
,
we
empirically
show
that
our
method
uses
the
ancestor-descendant
relation
more
effectively
than
the
conventional
ones
and
explain
that
our
method
is
justifiable
in
terms
of
a
coding
problem
.
3
Proposed
Method
The
methods
explained
in
this
section
construct
a
dependency
tree
by
series
of
actions
of
attaching
a
node
to
one
of
the
nodes
in
the
trees
being
constructed
.
Hence
,
when
the
parent
node
of
a
certain
node
is
being
determined
,
it
is
required
that
the
parent
node
should
already
be
included
in
the
tree
being
constructed
.
To
satisfy
the
requirement
,
we
note
the
characteristic
of
Japanese
dependencies
:
dependencies
are
directed
from
left
to
right
.
(
i.e.
,
the
parent
node
is
closer
to
the
end
of
a
sentence
than
its
child
node
)
.
Therefore
,
our
methods
analyze
a
sentence
backwards
as
in
Sekine
(
2000
)
and
Kudo
and
Matsumoto
(
2000
)
.
Consider
,
for
example
,
Figure
1
.
First
,
our
methods
determine
the
parent
node
of
ID
4
(
salad-accusative
)
,
and
then
that
of
ID
3
(
pizza-and
)
is
determined
.
Next
,
the
parent
node
of
ID
2
(
at
lunchtime
)
,
and
finally
,
that
of
ID
l
(
he-nominative
)
is
determined
and
dependencies
in
a
sentence
are
identified
.
Please
note
that
our
methods
are
applicable
only
to
dependency
structures
of
languages
that
have
a
consistent
head-direction
like
Japanese
.
We
explain
three
methods
that
are
different
in
the
information
used
in
determining
the
modifiee
of
each
chunk
.
In
Section
3.1
,
we
explain
PARENT
METHOD
and
ANCESTOR
METHOD
,
which
determine
the
modifiee
of
each
chunk
based
on
the
likeliness
of
only
one
type
of
the
relation
.
PARENT
METHOD
uses
the
parent-child
relation
,
which
is
used
in
conventional
Japanese
dependency
analysis
.
ANCESTOR
METHOD
is
novel
in
that
it
uses
the
ancestor-descendant
relation
which
has
not
been
used
in
the
existing
methods
.
In
Section
3.2
,
we
explain
our
method
,
PARENT-ANCESTOR
METHOD
,
which
determines
the
modifiees
based
on
the
likeliness
of
both
ancestor-descendant
and
parent-child
relations
.
When
the
modifiee
is
determined
using
the
ancestor-descendant
relation
,
it
is
necessary
to
take
into
account
the
relations
with
every
node
in
the
tree
.
Consider
,
for
example
,
the
case
that
the
modifiee
of
ID
l
(
he-nominative
)
is
determined
in
Figure
1
.
When
using
the
parent-child
relation
,
the
modifiee
can
be
determined
based
only
on
the
relation
between
ID
l
and
5
.
On
the
other
hand
,
when
using
the
ancestor-descendant
relation
,
the
modifiee
cannot
be
determined
based
only
on
the
relation
between
ID
l
and
5
.
This
is
because
if
one
of
ID
2
,
3
and
4
is
the
modifiee
of
ID
l
,
the
relation
between
ID
l
and
5
is
ancestor-descendant
.
ID
5
is
determined
as
the
modifiee
of
ID
l
only
after
the
relations
with
each
node
of
ID
2
,
3
and
4
are
recognized
not
to
be
ancestor-descendant
.
An
elegant
way
to
use
the
ancestor-descendant
relation
,
which
we
propose
in
this
paper
,
is
to
represent
a
dependency
as
a
codeword
where
each
bit
indicates
the
relation
with
a
node
in
the
tree
,
and
determine
the
modifiee
based
on
the
relations
with
every
node
in
the
tree
(
for
details
to
the
next
section
)
.
3.1
Methods
with
a
single
relation
:
PARENT
METHOD
and
ANCESTOR
METHOD
Figure
2
shows
the
pseudo
code
of
the
algorithm
to
construct
a
dependency
tree
using
PARENT
METHOD
or
ANCESTOR
METHOD
.
mentioned
above
,
the
two
methods
analyze
a
sentence
backwards
.
MODEL_PARENT
(
nodei
,
nodej
)
indicates
the
prediction
whether
nodej
is
the
parent
of
nodei
or
not
,
which
is
the
output
of
the
learned
model
.
MODEL_ANCESTOR
(
nodei
,
nodej
)
indicates
the
prediction
whether
nodej
is
the
ancestor
ofnodei
or
not
.
String-output
indicates
the
sequence
of
the
i
—
l
predictions
stored
in
step
3
.
The
codeword
denoted
by
string
[
k
]
is
the
binary
sequence
given
to
the
action
that
nodei
is
attached
to
nodek
.
Parent
[
nodei
]
indicates
the
node
to
which
nodei
is
attached
,
and
Dis
indicates
a
distance
function
.
Thus
,
our
method
predicts
the
correct
actions
by
measuring
the
distance
between
the
codeword
string
[
k
]
and
the
predicted
binary
(
later
extended
to
real-valued
)
sequences
string
output
.
In
other
words
,
our
method
selects
the
action
that
is
the
closest
to
the
outputs
of
the
learned
model
.
Both
models
are
learned
from
dependency
trees
given
as
training
data
as
shown
in
Figure
3
.
Each
relation
is
learned
from
ordered
pairs
of
two
nodes
in
the
trees
.
However
,
our
algorithm
in
Figure
2
targets
at
dependencies
directed
from
left
to
right
.
3
:
result_parent
[
j
]
=
MODELJ
&gt;
ARENT
(
nodei
,
nodej
)
(
in
case
of
PARENT
and
PARENT-ANCESTOR
METHOD
)
3
:
result_ancestor
[
j
]
=
MODEL^
\
NCESTOR
(
nodei
,
nodej
)
(
in
case
of
ANCESTOR
and
PARENT-ANCESTOR
METHOD
)
4
:
end
Figure
2
:
Pseudo
code
of
PARENT
,
ANCESTOR
,
and
PARENT-ANCESTOR
METHODS
MODEL_PARENT
MODEL_ANCESTOR
Figure
3
:
Example
of
training
instances
Therefore
,
the
instances
with
a
right-to-left
dependency
are
excluded
from
the
training
data
.
For
example
,
the
instance
with
node4
being
the
candidate
parent
(
or
ancestor
)
of
nodel
is
excluded
in
Figure
3
.
MODEL
.
PARENT
uses
ordered
pairs
of
a
parent
node
and
a
child
node
as
positive
instances
and
the
other
ordered
pairs
as
negative
instances
.
MODEL_ANCESTOR
uses
ordered
pairs
of
an
ancestor
node
and
a
descendant
node
as
positive
instances
and
the
other
ordered
pairs
as
negative
instances
.
From
the
above
description
and
Figure
3
,
the
number
of
training
instances
used
in
learning
MODELJARENT
is
the
same
as
the
number
of
training
instances
used
in
learning
MODEL_ANCESTOR
.
However
,
the
number
of
positive
instances
in
learning
MODEL_ANCESTOR
is
larger
than
in
learning
MODELJARENT
because
the
set
of
parent-child
relations
is
a
subset
of
ancestor-descendant
relations
.
As
mentioned
above
,
the
two
methods
analyze
a
sentence
backwards
.
We
should
note
that
node
\
to
noden
in
the
algorithm
respectively
correspond
to
the
last
chunk
to
the
first
chunk
of
a
sentence
.
Next
,
we
illustrate
the
process
of
determining
the
parent
node
of
a
certain
node
nodem
(
with
Figures
4
and
5
)
.
Hereafter
,
nodem
is
called
a
target
node
.
The
parent
node
is
determined
based
on
the
like-liness
of
a
relation
;
the
parent-child
and
ancestor
-
descendant
relation
are
used
in
PARENT
METHOD
and
ANCESTOR
METHOD
respectively
.
Our
methods
regard
a
dependency
between
the
target
node
and
its
parent
node
as
a
set
of
relations
between
the
target
node
and
each
node
in
the
tree
.
Each
relation
corresponds
to
one
bit
,
which
becomes
l
if
the
relation
holds
,
—
l
otherwise
.
For
example
,
a
sequence
(
—
l
,
—
l
,
—
l
,
l
)
represents
that
the
parent
of
node5
is
node4
in
PARENT
METHOD
(
Figure
4
)
,
since
the
relation
holds
only
between
nodes
4
and
5
.
First
,
the
learned
model
judges
whether
the
target
node
and
each
node
in
the
current
tree
are
in
a
certain
relation
or
not
;
PARENT
METHOD
uses
MODELJARENT
as
the
learned
model
and
ANCESTOR
METHOD
uses
MODEL_ANCESTOR
.
The
sequence
of
the
m
—
l
predictions
by
the
learned
model
is
stored
in
string-output
.
The
codeword
string
[
k
]
is
the
binary
(
—
l
or
l
)
sequence
that
is
to
be
output
when
the
target
node
is
attached
to
the
nodek
.
In
Figures
4
and
5
,
the
set
of
string
[
k
]
(
for
node5
)
is
in
the
dashed
square
.
For
example
,
string
[
2
]
in
ANCESTOR
METHOD
(
Figure
5
)
is
(
l
,
l
,
—
l
,
—
l
)
since
nodes
1
and
2
are
the
ancestor
of
node5
if
node5
is
attached
to
node2
.
Next
,
among
the
set
of
string
[
k
]
,
the
codeword
that
is
the
closest
to
the
string-output
is
selected
.
The
target
node
is
then
attached
to
the
node
corresponding
to
the
selected
codeword
.
In
Figure
4
,
the
string
[
4
]
,
(
—
l
,
—
l
,
—
l
,
l
)
,
is
selected
and
then
node5
is
attached
to
node4
.
Japanese
dependencies
have
the
non-crossing
constraint
:
dependencies
do
not
cross
one
another
.
To
satisfy
the
constraint
,
we
remove
the
nodes
that
will
break
the
non-crossing
constraint
from
the
candidates
of
a
parent
node
in
step
5
of
the
algorithm
.
PARENT
METHOD
differs
from
conventional
methods
such
as
Sekine
(
2000
)
or
Kudo
and
Mat-sumoto
(
2000
)
,
in
the
process
of
determining
the
parent
node
.
These
conventional
methods
select
the
node
given
by
argmaxjP
(
nodej
|
nodei
)
as
the
parent
node
of
nodei
,
setting
the
beam
width
to
1
.
However
,
their
processes
are
essentially
the
same
as
the
process
in
PARENT
METHOD
.
Figure
4
:
Analysis
example
using
PARENT
METHOD
MODEL_
ANCESTOR
Sequences
which
can
be
got
in
judgment
Figure
5
:
Analysis
example
using
ANCESTOR
METHOD
3.2
Proposed
method
:
PARENT-ANCESTOR
METHOD
The
proposed
method
determines
the
parent
node
of
a
target
node
based
on
the
likeliness
of
ancestor-descendant
relations
in
addition
to
parent-child
relations
.
The
use
of
ancestor-descendant
relations
makes
it
possible
to
capture
the
characteristics
which
cannot
be
captured
by
parent-child
relations
alone
.
The
pseudo
code
of
the
proposed
method
,
PARENT-ANCESTOR
METHOD
,
is
shown
in
Figure
2
.
MODEL.PARENT
and
MODEL_ANCESTOR
are
learned
as
described
in
Section
3.1
.
String
output
is
the
concatenation
of
the
predictions
by
both
MODELJPARENT
and
MODEL_ANCESTOR
.
In
addition
,
string
[
k
]
is
provided
based
not
only
on
parent-child
relations
but
also
on
ancestor-descendant
relations
.
An
analysis
example
using
PARENT-ANCESTOR
METHOD
is
shown
in
Figure
6
.
Figure
6
:
Analysis
example
using
PARENT
-
ANCESTOR
METHOD
4
Experiment
4.1
Experimental
settings
sumoto
,
2002
;
Sassano
,
2004
)
.
We
used
SVMs
as
the
algorithm
of
learning
and
analyzing
the
relations
between
nodes
.
We
used
the
third
degree
polynomial
kernel
function
and
set
the
soft
margin
parameter
C
to
1
,
which
is
exactly
the
same
setting
as
in
Kudo
and
Matsumoto
(
2002
)
.
We
can
obtain
the
real-valued
score
in
step
3
of
the
algorithm
,
which
is
the
output
of
the
separating
function
.
The
score
can
be
regarded
as
likeliness
of
the
two
nodes
being
in
the
parent-child
(
or
the
ancestor-descendant
)
.
Therefore
,
we
used
the
sequence
of
the
outputs
of
SVMs
as
string
output
,
instead
of
converting
the
scores
into
binary
values
indicating
whether
a
certain
relation
holds
or
not
.
Two
feature
sets
are
used
:
static
features
and
dynamic
features
.
The
static
features
used
in
the
experiments
are
shown
in
Table
1
.
The
features
are
the
same
as
those
used
in
Kudo
and
Matsumoto
(
2002
)
.
In
Table
1
,
HeadWord
means
the
rightmost
content
word
in
the
chunk
whose
part-of-speech
is
not
a
functional
category
.
FunctionalWord
means
the
Table
1
:
Static
features
used
in
experiments
Modifier
/
Modifiee
Head
Word
(
surface-form
,
POS
,
POS-subcategory
,
inflection-type
,
inflection-form
)
,
Functional
Word
(
surface-form
,
POS
,
POS-subcategory
,
inflection-type
,
inflection-form
)
,
brackets
,
quotation-marks
,
punctuation-marks
,
position
in
sentence
(
beginning
,
end
)
Between
two
chunks
Figure
7
:
Dynamic
features
rightmost
functional
word
or
the
inflectional
form
of
the
rightmost
predicate
ifthere
is
no
functional
word
in
the
chunk
.
Next
,
we
explain
the
dynamic
features
used
in
the
experiments
.
Three
types
of
dynamic
features
were
used
in
Kudo
and
Matsumoto
(
2002
)
:
(
A
)
the
chunks
modifying
the
current
candidate
modifiee
,
(
B
)
the
chunk
modified
by
the
current
candidate
modifiee
,
and
(
C
)
the
chunks
modifying
the
current
candidate
modifier
.
The
type
C
is
not
available
in
the
proposed
method
because
the
proposed
method
analyzes
a
sentence
backwards
unlike
Kudo
and
Mat-sumoto
(
2002
)
.
Therefore
,
we
did
not
use
the
type
C.
We
used
the
type
A
'
and
B
'
which
are
recursive
expansion
of
type
A
and
B
as
the
dynamic
features
(
Figure
7
)
.
The
form
of
functional
words
or
inflection
was
used
as
a
type
A
'
feature
and
POS
and
POS-subcategory
of
HeadWord
as
a
type
B
'
feature
.
4.2
Experimental
results
In
this
section
,
we
show
the
effectiveness
of
the
proposed
method
.
First
,
we
compare
the
three
methods
described
in
Section
3
:
PARENT
METHOD
,
ANCESTOR
METHOD
,
and
PARENT-ANCESTOR
METHOD
.
The
results
are
shown
in
Table
2
.
Here
,
dependency
accuracy
is
the
percentage
of
correct
dependencies
(
correct
parent-child
relations
in
trees
in
test
data
)
,
and
sentence
accuracy
is
the
percentage
of
the
sentences
in
which
all
the
modifiees
are
determined
correctly
(
correctly
constructed
trees
in
test
data
)
.
Table
2
shows
that
PARENT-ANCESTOR
METHOD
is
more
accurate
than
the
other
two
Table
2
:
Result
of
dependency
analysis
using
methods
described
in
Section
3_
Table
3
:
Comparison
to
conventional
methods
Dynamic
A
,
B
Original
methods
.
In
other
words
,
the
accuracy
of
dependency
analysis
improves
by
utilizing
the
redundant
information
.
Next
,
we
compare
the
proposed
method
with
conventional
methods
.
We
compare
the
proposed
method
particularly
with
Kudo
and
Matsumoto
(
2002
)
with
the
same
feature
set
.
The
reasons
are
that
Cascaded
Chunking
Model
proposed
in
Kudo
and
Matsumoto
(
2002
)
is
used
in
a
popular
Japanese
dependency
analyzer
,
CaboCha
1
,
and
the
comparison
can
highlight
the
effectiveness
of
our
approach
because
we
can
experiment
under
the
same
conditions
(
e.g.
,
dataset
,
feature
set
,
learning
algorithm
)
.
A
summary
of
the
comparison
is
shown
in
Table
3
.
Table
3
shows
that
the
proposed
method
outperforms
conventional
methods
except
Sas-sano
(
2004
)
2
,
while
Sassano
(
2004
)
used
richer
features
which
are
not
used
in
the
proposed
method
,
such
as
features
for
conjunctive
structures
based
on
Kurohashi
and
Nagao
(
1994
)
,
features
concerning
the
leftmost
content
word
in
the
candidate
modifiee
.
The
comparison
of
the
proposed
method
with
Sassano
(
2004
)
'
s
method
without
the
features
of
2We
have
not
tested
the
improvement
statistically
because
we
do
not
have
access
to
the
conventional
methods
.
Table
4
:
Accuracy
of
dependency
analysis
on
parallel
structures_
Parallel
structures
Other
than
parallel
structures
PARENT-ANCESTOR
conjunctive
structures
(
w
/
o
Conj
)
and
without
the
richer
features
derived
from
the
words
in
chunks
(
w
/
o
Rich
)
suggests
that
the
proposed
method
is
better
than
or
comparable
to
Sassano
(
2004
)
'
s
method
.
5
Discussion
5.1
Performance
on
parallel
structures
As
mentioned
in
Section
1
,
the
ancestor-descendant
relation
is
supposed
to
help
to
capture
parallel
structures
.
In
this
section
,
we
discuss
the
performance
of
dependency
analysis
on
parallel
structures
.
Parallel
structures
such
as
those
of
nouns
(
e.g.
,
Tom
and
Ken
eat
hamburgers
.
)
and
those
of
verbs
(
e.g.
,
Tom
eats
hamburgers
and
drinks
water
.
)
,
are
marked
in
Kyoto
University
text
corpus
.
We
investigate
the
accuracy
of
dependency
analysis
on
parallel
structures
using
the
information
.
Table
4
shows
that
the
accuracy
on
parallel
structures
improves
by
adding
the
ancestor-descendant
relation
.
The
improvement
is
statistically
significant
in
the
sign-test
with
1
%
significance-level
.
Table
4
also
shows
that
error
reduction
rate
on
parallel
structures
by
adding
the
ancestor-descendant
relation
is
8.3
%
and
the
rate
on
the
others
is
4.7
%
.
These
show
that
the
ancestor-descendant
relation
work
well
especially
for
parallel
structures
.
In
Table
4
,
the
accuracy
on
parallel
structures
using
PARENT
METHOD
is
slightly
better
than
that
using
ANCESTOR
METHOD
,
while
the
difference
is
not
statistically
significant
in
the
signtest
.
It
shows
that
the
parent-child
relation
is
also
necessary
for
capturing
the
characteristics
of
parallel
structures
.
Consider
the
following
two
instances
in
Figure
1
as
an
example
:
the
ordered
pair
of
ID
3
(
pizza-and
)
and
ID
5
(
ate
)
,
and
the
ordered
pair
of
ID
4
(
salad-accusative
)
and
ID
5
.
In
ANCESTOR
METHOD
,
both
instances
are
positive
instances
.
On
the
other
hand
,
only
the
ordered
pair
of
ID
4
and
ID
5
is
a
positive
instance
in
PARENT
METHOD
.
Table
5
:
Comparison
between
usages
of
the
ancestor-descendant
relation
_
Dependency
Sentence
Accuracy
Hence
,
PARENT
METHOD
can
learn
appropriate
case-particles
in
a
modifier
of
a
verb
.
For
example
,
the
particle
which
means
"
and
"
does
not
modify
verbs
.
However
,
it
is
difficult
for
ANCESTOR
METHOD
to
learn
the
characteristic
.
Therefore
,
both
parent-child
and
ancestor-descendant
relations
are
necessary
for
capturing
parallel
structures
.
5.2
Discussion
on
usages
of
the
ancestor-descendant
relation
In
the
proposed
method
,
MODEL_ANCESTOR
,
which
judges
whether
the
relation
between
two
nodes
is
ancestor-descendant
or
not
,
is
prepared
,
and
the
information
on
the
ancestor-descendant
relation
is
directly
utilized
.
On
the
other
hand
,
conventional
methods
add
the
features
regarding
the
ancestor
or
descendant
chunk
to
capture
the
ancestor-descendant
relation
.
In
this
section
,
we
empirically
show
that
the
proposed
method
utilizes
the
information
on
the
ancestor-descendant
relation
more
effectively
than
conventional
methods
.
The
results
in
the
previous
sections
could
not
show
the
effectiveness
because
MODELJPARENT
and
MODEL_ANCESTOR
in
the
proposed
method
use
the
features
regarding
the
ancestor-descendant
relation
.
Table
5
shows
the
result
of
dependency
analysis
using
two
types
of
usages
of
the
information
on
the
ancestor-descendant
relation
.
"
Feature
"
indicates
the
conventional
usage
and
"
Model
"
indicates
our
usage
.
Please
note
that
MODELJPARENT
and
MODEL_ANCESTOR
used
in
"
Model
"
do
not
use
the
features
regarding
the
ancestor-descendant
relation
.
Table
5
shows
that
our
usage
is
more
effective
than
the
conventional
usage
.
This
is
because
our
usage
takes
advantage
of
redundancy
in
terms
of
a
coding
problem
as
described
in
the
next
section
.
Moreover
,
the
learned
features
through
the
proposed
method
would
include
more
information
than
ad-hoc
features
that
were
manually
added
.
5.3
Proposed
method
in
terms
of
a
coding
problem
In
a
coding
problem
,
redundancy
is
effectively
utilized
so
that
information
can
be
transmitted
more
properly
(
Mackay
,
2003
)
.
This
idea
is
the
same
as
the
main
point
of
the
proposed
method
.
In
this
section
,
we
discuss
the
proposed
method
in
terms
of
a
coding
problem
.
In
a
coding
problem
,
when
encoding
information
,
the
redundant
bits
are
attached
so
that
the
added
redundancy
helps
errors
be
corrected
.
Moreover
,
the
following
fact
is
known
(
Mackay
,
2003
)
:
the
error-correcting
ability
is
higher
when
the
distances
between
the
codewords
are
longer
.
(
1
)
For
example
,
consider
the
following
three
types
of
encodings
:
(
A
)
two
events
are
encoded
respectively
into
the
codewords
—
1
and
1
(
the
simplest
encoding
)
,
(
B
)
into
the
codewords
(
—
1
,
—
1,1
)
and
(
1,1,1
)
(
hamming
distance
:
2
)
,
and
(
C
)
into
the
codewords
(
—
1
,
—
1
,
—
1
)
and
(
1
,
1
,
1
)
(
hamming
distance
:
3
)
.
Please
note
thatthe
hamming
distance
is
defined
as
the
number
of
bits
that
differ
between
two
codewords
.
In
(
A
)
,
the
correct
information
is
not
transmitted
if
a
one-bit
error
occurs
.
In
(
B
)
,
if
an
error
occurs
in
the
third
bit
,
the
error
can
be
corrected
by
assuming
that
the
original
codeword
is
closest
to
the
received
codeword
.
In
(
C
)
,
any
one-bit
error
can
be
corrected
.
Thus
,
(
B
)
has
the
higher
error-correcting
ability
than
(
A
)
,
and
(
C
)
has
the
higher
error-correcting
ability
than
(
B
)
.
We
explain
the
problem
of
determining
the
parent
node
of
a
target
node
in
the
proposed
method
in
terms
of
the
coding
theory
.
A
sequence
of
numbers
corresponds
to
a
codeword
.
It
is
assumed
that
the
codeword
which
expresses
the
correct
parent
node
of
the
target
node
is
transmitted
.
The
codeword
is
transmitted
through
the
learned
model
through
channels
to
the
receiver
.
The
receiver
infers
the
parent
node
from
the
received
sequence
(
stringoutput
)
in
consideration
of
the
codewords
that
can
be
transmitted
(
string
[
k
]
)
.
Therefore
,
error-correcting
ability
,
the
ability
of
correcting
the
errors
in
predictions
in
step
3
,
is
dependent
on
the
distances
between
the
codewords
(
string
[
k
]
)
.
The
codewords
in
PARENT-ANCESTOR
METHOD
are
the
concatenation
ofthe
bits
based
on
both
parent-child
relations
and
ancestor-descendant
relations
.
Consequently
,
the
distances
between
codewords
in
PARENT-ANCESTOR
METHOD
are
longer
than
those
in
PARENT
METHOD
or
ANCESTOR
METHOD
.
From
(
1
)
,
the
error-correcting
ability
is
expected
to
be
higher
.
In
terms
of
a
coding
problem
,
the
proposed
method
exploits
the
essence
of
(
1
)
,
and
utilizes
ancestor-descendant
relations
effectively
.
We
assume
that
every
bit
added
as
redundancy
is
correctly
transmitted
for
the
above-mentioned
discussion
.
However
,
some
of
these
added
bits
may
be
transmitted
wrongly
in
the
proposed
method
.
In
that
case
,
the
added
redundancy
may
not
help
errors
be
corrected
than
cause
an
error
.
In
the
experiments
of
dependency
analysis
,
the
advantage
prevails
against
the
disadvantage
because
accuracy
of
each
bit
of
the
codeword
is
94.5
%
,
which
is
high
value
.
Discussion
on
applicability
of
existing
codes
A
number
of
approaches
use
Error
Correcting
Output
Coding
(
ECOC
)
(
Dietterich
and
Bakiri
,
1995
;
Ghani
,
2000
)
for
solving
multiclass
classification
problems
as
a
coding
problem
.
The
approaches
assign
a
unique
n-bit
codeword
to
each
class
,
and
then
n
classifiers
are
trained
to
predict
each
bit
.
The
predicted
class
is
the
one
whose
codeword
is
closest
to
the
codeword
produced
by
the
classifiers
.
The
codewords
in
these
approaches
are
designed
to
be
well-separated
from
one
another
and
have
sufficient
error-correcting
ability
(
e.g.
,
BCH
code
)
.
However
,
these
existing
codewords
are
not
applicable
to
the
proposed
method
.
In
the
proposed
method
,
we
have
two
models
respectively
derived
from
the
parent-child
and
ancestor-descendant
relation
,
which
can
be
interpreted
in
terms
of
both
linguistic
aspects
and
tree
structures
.
If
we
use
ECOC
,
however
,
pairs
of
nodes
are
divided
into
positive
and
negative
instances
arbitrarily
.
Since
this
division
lacks
linguistic
or
structural
meaning
,
training
instances
will
lose
consistency
and
any
proper
model
will
not
be
obtained
.
Moreover
,
we
have
to
prepare
different
models
for
each
stage
in
tree
construction
,
because
the
length
of
the
codewords
vary
according
to
the
number
of
nodes
in
the
current
tree
.
Table
6
:
Result
of
dependency
analysis
using
various
distance
functions_
Distance
Function
Dependency
Accuracy
Sentence
Accuracy
PARENT
(
f
)
ANCESTOR
(
n
)
ANCESTOR
(
f
)
Proposed
method
(
n
)
Proposed
method
(
f
)
ANCESTOR
Euclidean
PARENT
(
n
)
Manhattan
PARENT
(
f
)
ANCESTOR
Proposed
method
5.4
Influence
of
distance
functions
In
this
section
,
we
compare
the
performance
of
dependency
analysis
with
various
distance
functions
:
hamming
distance
,
euclidean
distance
,
cosine
distance
,
and
manhattan
distance
.
These
distance
functions
between
sequences
X
=
"
x1
x2
.
.
.
xn
"
and
Y
=
"
y1
y2
.
.
.
Vn
"
are
defined
as
follows
:
•
Man
(
X
,
Y
)
=
Eti
I
x
*
—
y
*
|
.
In
the
hamming
distance
,
string-output
is
converted
to
a
binary
sequence
with
their
elements
being
of
—
1
or
1
.
The
cosine
distance
is
equivalent
to
the
Euclidean
distance
under
the
condition
that
the
absolute
value
of
every
component
of
string
[
k
]
is
The
results
of
dependency
analysis
using
these
distance
functions
are
shown
in
Table
6
.
In
Table
6
,
'
(
n
)
'
means
that
the
nearest
chunk
in
a
sentence
is
selected
as
the
modifiee
in
order
to
break
a
tie
,
which
happens
when
the
number
of
sequences
satisfying
the
condition
in
step
5
is
two
or
more
,
while
'
(
f
)
'
means
that
the
furthest
chunk
is
selected
.
If
the
results
in
case
of
(
n
)
and
(
f
)
are
the
same
,
(
n
)
and
(
f
)
are
omitted
and
only
one
result
is
shown
.
Table
6
shows
that
the
proposed
method
outperforms
PARENT
METHOD
and
ANCESTOR
METHOD
in
any
distance
functions
.
It
means
that
the
effectiveness
of
the
proposed
method
does
not
depend
on
distance
functions
.
The
result
using
the
hamming
distance
is
much
worse
than
using
the
other
distance
functions
.
It
means
that
using
the
scores
output
by
SVMs
as
the
likeliness
of
a
certain
relation
improves
the
accuracy
.
The
results
of
(
n
)
and
(
f
)
in
the
hamming
distance
are
different
.
It
is
because
the
hamming
distances
are
always
positive
integers
and
ties
are
more
likely
to
happen
.
Table
6
also
shows
that
the
result
of
the
cosine
or
the
euclidean
distance
is
better
than
that
of
the
manhattan
distance
.
6
Conclusions
We
proposed
a
novel
method
for
Japanese
dependency
analysis
,
which
determines
the
modifiee
of
each
chunk
based
on
the
likeliness
not
only
of
the
parent-child
relation
but
also
of
the
ancestor-descendant
relation
in
a
dependency
tree
.
The
ancestor-descendant
relation
makes
it
possible
to
capture
the
parallel
structures
in
more
depth
.
In
terms
of
a
coding
theory
,
the
proposed
method
boosts
error-correcting
ability
by
adding
the
redundant
bits
based
on
ancestor-descendant
relations
and
increasing
the
distance
between
two
codewords
.
Experimental
results
showed
the
effectiveness
of
the
proposed
method
.
In
addition
,
the
results
showed
that
the
proposed
method
outperforms
conventional
methods
.
Future
work
includes
the
following
.
In
this
paper
,
we
use
the
features
proposed
in
Kudo
and
Mat-sumoto
(
2002
)
.
By
extracting
new
features
that
are
more
suitable
for
the
ancestor-descendant
relation
,
we
can
further
improve
our
method
.
The
features
used
by
Sassano
(
2004
)
are
promising
as
well
.
We
are
also
planning
to
apply
the
proposed
method
to
other
tasks
which
need
to
construct
tree
structures
.
For
example
,
(
zero
-
)
anaphora
resolution
is
considered
as
a
good
candidate
task
for
application
.
