This
paper
proposes
a
tree
kernel
with
context-sensitive
structured
parse
tree
information
for
relation
extraction
.
It
resolves
two
critical
problems
in
previous
tree
kernels
for
relation
extraction
in
two
ways
.
First
,
it
automatically
determines
a
dynamic
context-sensitive
tree
span
for
relation
extraction
by
extending
the
widely-used
Shortest
Path-enclosed
Tree
(
SPT
)
to
include
necessary
context
information
outside
SPT
.
Second
,
it
proposes
a
context-sensitive
convolution
tree
kernel
,
which
enumerates
both
context-free
and
context-sensitive
sub-trees
by
considering
their
ancestor
node
paths
as
their
contexts
.
Evaluation
on
the
ACE
RDC
corpora
shows
that
our
dynamic
context-sensitive
tree
span
is
much
more
suitable
for
relation
extraction
than
SPT
and
our
tree
kernel
outperforms
the
state-of-the-art
Collins
and
Duffy
's
convolution
tree
kernel
.
It
also
shows
that
our
tree
kernel
achieves
much
better
performance
than
the
state-of-the-art
linear
kernels
.
Finally
,
it
shows
that
feature-based
and
tree
kernel-based
methods
much
complement
each
other
and
the
composite
kernel
can
well
integrate
both
flat
and
structured
features
.
1
Introduction
Relation
extraction
is
to
find
various
predefined
semantic
relations
between
pairs
of
entities
in
text
.
The
research
in
relation
extraction
has
been
promoted
by
the
Message
Understanding
Conferences
(
MUCs
)
(
MUC
,
1987-1998
)
and
the
NIST
Automatic
Content
Extraction
(
ACE
)
program
(
ACE
,
2002-2005
)
.
According
to
the
ACE
Program
,
an
entity
is
an
object
or
a
set
of
objects
in
the
world
and
a
relation
is
an
explicitly
or
implicitly
stated
relationship
among
entities
.
For
example
,
the
sentence
"
Bill
Gates
is
the
chairman
and
chief
software
architect
of
Microsoft
Corporation
.
"
conveys
the
ACE-style
relation
"
EMPLOYMENT.exec
"
between
the
entities
"
Bill
Gates
"
(
person
name
)
and
'
Microsoft
Corporation
"
(
organization
name
)
.
Extraction
of
semantic
relations
between
entities
can
be
very
useful
in
many
applic
a-tions
such
as
question
answering
,
e.g.
to
answer
the
query
"
Who
is
the
president
of
the
United
States
?
"
,
and
information
retrieval
,
e.g.
to
expand
the
query
"
George
W.
Bush
"
with
"
the
president
of
the
United
States
"
via
his
relationship
with
"
the
United
States
"
.
Many
researches
have
been
done
in
relation
extraction
.
Among
them
,
feature-based
methods
(
Kamb-hatla
2004
;
Zhou
et
al.
,
2005
)
achieve
certain
success
by
employing
a
large
amount
of
diverse
linguistic
features
,
varying
from
lexical
knowledge
,
entity-related
information
to
syntactic
parse
trees
,
dependency
trees
and
semantic
information
.
However
,
it
is
difficult
for
them
to
effectively
capture
structured
parse
tree
information
(
Zhou
et
al
2005
)
,
which
is
critical
for
further
performance
improvement
in
relation
extraction
.
As
an
alternative
to
feature-based
methods
,
tree
kernel-based
methods
provide
an
elegant
solution
to
explore
implicitly
structured
features
by
directly
computing
the
similarity
between
two
trees
.
Although
earlier
researches
(
Zelenko
et
al
2003
;
Culotta
and
Sorensen
2004
;
Bunescu
and
Mooney
2005a
)
only
achieve
success
on
simple
tasks
and
fail
on
complex
tasks
,
such
as
the
ACE
RDC
task
,
tree
kernel-based
methods
achieve
much
progress
recently
.
As
the
state-of-the-art
,
Zhang
et
al
(
2006
)
applied
the
convolution
tree
kernel
(
Collins
and
Duffy
2001
)
and
achieved
comparable
performance
with
a
state-of-the-art
linear
kernel
(
Zhou
et
al
2005
)
on
the
5
relation
types
in
the
ACE
RDC
2003
corpus
.
However
,
there
are
two
problems
in
Collins
and
Duffy
's
convolution
tree
kernel
for
relation
extraction
.
The
first
is
that
the
sub-trees
enumerated
in
the
tree
kernel
computation
are
context-free
.
That
is
,
each
sub-tree
enumerated
in
the
tree
kernel
computation
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
728-736
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
does
not
consider
the
context
information
outside
the
sub-tree
.
The
second
is
to
decide
a
proper
tree
span
in
relation
extraction
.
Zhang
et
al
(
2006
)
explored
five
tree
spans
in
relation
extraction
and
it
was
a
bit
surprising
to
find
that
the
Shortest
Path-enclosed
Tree
(
SPT
,
i.e.
the
sub-tree
enclosed
by
the
shortest
path
linking
two
involved
entities
in
the
parse
tree
)
performed
best
.
This
is
contrast
to
our
intuition
.
For
example
,
"
got
married
"
is
critical
to
determine
the
relationship
between
"
John
"
and
"
Mary
"
in
the
sentence
"
John
and
Mary
got
married
.
.
.
"
as
shown
in
Figure
1
(
e
)
.
It
is
obvious
that
the
information
contained
in
SPT
(
"
John
and
Marry
"
)
is
not
enough
to
determine
their
relationship
.
This
paper
proposes
a
context-sensitive
convolution
tree
kernel
for
relation
extraction
to
resolve
the
above
two
problems
.
It
first
automatically
determines
a
dynamic
context-sensitive
tree
span
for
relation
extraction
by
extending
the
Shortest
Path-enclosed
Tree
(
SPT
)
to
include
necessary
context
information
outside
SPT
.
Then
it
proposes
a
context-sensitive
convolution
tree
kernel
,
whic
h
not
only
enumerates
context-free
sub-trees
but
also
context-sensitive
sub-trees
by
considering
their
ancestor
node
paths
as
their
contexts
.
Moreover
,
this
paper
evaluates
the
complementary
nature
of
different
linear
kernels
and
tree
kernels
via
a
composite
kernel
.
The
layout
of
this
paper
is
as
follows
.
In
Section
2
,
we
review
related
work
in
more
details
.
Then
,
the
dynamic
context-sensitive
tree
span
and
the
context-sensitive
convolution
tree
kernel
are
proposed
in
Section
3
while
Section
4
shows
the
experimental
results
.
Finally
,
we
conclude
our
work
in
Section
5
.
2
Related
Work
The
relation
extraction
task
was
first
introduced
as
part
of
the
Template
Element
task
in
MUC6
and
then
formulated
as
the
Template
Relation
task
in
MUC7
.
Since
then
,
many
methods
,
such
as
feature-based
(
Kambhatla
2004
;
Zhou
et
al
2005
,
2006
)
,
tree
kernel-based
(
Zelenko
et
al
2003
;
Culotta
and
Sorensen
2004
;
Bunescu
and
Mooney
2005a
;
Zhang
et
al
2006
)
and
composite
kernel-based
(
Zhao
and
Gris
hman
2005
;
Zhang
et
al
2006
)
,
have
been
proposed
in
literature
.
For
the
feature-based
methods
,
Kambhatla
(
2004
)
employed
Maximum
Entropy
models
to
combine
diverse
lexical
,
syntactic
and
semantic
features
in
relation
extraction
,
and
achieved
the
F-measure
of
52.8
on
the
24
relation
subtypes
in
the
ACE
RDC
2003
corpus
.
Zhou
et
al
(
2005
)
further
systematically
explored
diverse
features
through
a
linear
kernel
and
Support
Vector
Machines
,
and
achieved
the
F
-
measures
of
68.0
and
55.5
on
the
5
relation
types
and
the
24
relation
subtypes
in
the
ACE
RDC
2003
corpus
respectively
.
One
problem
with
the
feature-based
methods
is
that
they
need
extensive
feature
engineering
.
Another
problem
is
that
,
although
they
can
explore
some
structured
information
in
the
parse
tree
(
e.g.
Kambhatla
(
2004
)
used
the
non-terminal
path
c
onnecting
the
given
two
entities
in
a
parse
tree
while
Zhou
et
al.
(
2005
)
introduced
additional
chunking
features
to
enhance
the
performance
)
,
it
is
found
difficult
to
well
preserve
structured
information
in
the
parse
trees
using
the
feature-based
methods
.
Zhou
et
al
(
2006
)
further
improved
the
performance
by
exploring
the
commonality
among
related
classes
in
a
class
hierarchy
using
hierarchical
learning
strategy
.
As
an
alternative
to
the
feature-based
methods
,
the
kernel-based
methods
(
Haussler
,
1999
)
have
been
proposed
to
implicitly
explore
various
features
in
a
high
dimensional
space
by
employing
a
kernel
to
calculate
the
similarity
between
two
objects
directly
.
In
particular
,
the
kernel-based
methods
could
be
very
effective
at
reducing
the
burden
of
feature
engineering
for
structured
objects
in
NLP
researches
,
e.g.
the
tree
structure
in
relation
extraction
.
Zelenko
et
al.
(
2003
)
proposed
a
kernel
between
two
parse
trees
,
which
recursively
matches
nodes
from
roots
to
leaves
in
a
top-down
manner
.
For
each
pair
of
matched
nodes
,
a
subsequence
kernel
on
their
child
nodes
is
invoked
.
They
achieved
quite
success
on
two
simple
relation
extraction
tasks
.
Culotta
and
Sorensen
(
2004
)
extended
this
work
to
estimate
similarity
between
augmented
dependency
trees
and
achieved
the
F-measure
of
45.8
on
the
5
relation
types
in
the
ACE
RDC
2003
corpus
.
One
problem
with
the
above
two
tree
kernels
is
that
matched
nodes
must
be
at
the
same
height
and
have
the
same
path
to
the
root
node
.
Bunescu
and
Mooney
(
2005a
)
proposed
a
shortest
path
dependency
tree
kernel
,
which
just
sums
up
the
number
of
common
word
classes
at
each
position
in
the
two
paths
,
and
achieved
the
F-measure
of
52.5
on
the
5
relation
types
in
the
ACE
RDC
2003
corpus
.
They
argued
that
the
information
to
model
a
relationship
between
two
entities
can
be
typically
captured
by
the
shortest
path
between
them
in
the
dependency
graph
.
While
the
shortest
path
may
not
be
able
to
well
preserve
structured
dependency
tree
information
,
another
problem
with
their
kernel
is
that
the
two
paths
should
have
same
length
.
This
makes
it
suffer
from
the
similar
behavior
with
that
of
Culotta
and
Sorensen
(
2004
)
:
high
precision
but
very
low
recall
.
As
the
state-of-the-art
tree
kernel-based
method
,
Zhang
et
al
(
2006
)
explored
various
structured
feature
spaces
and
used
the
convolution
tree
kernel
over
parse
trees
(
Collins
and
Duffy
2001
)
to
model
syntactic
structured
information
for
relation
extraction
.
They
achieved
the
F-measures
of
61.9
and
63.6
on
the
5
relation
types
of
the
ACE
RDC
2003
corpus
and
the
7
relation
types
of
the
ACE
RDC
2004
corpus
respectively
without
entity-related
information
while
the
F-measure
on
the
5
relation
types
in
the
ACE
RDC
2003
corpus
reached
68.7
when
entity-related
information
was
included
in
the
parse
tree
.
One
problem
with
Collins
and
Duffy
's
convolution
tree
kernel
is
that
the
sub-trees
involved
in
the
tree
kernel
computation
are
context-free
,
that
is
,
they
do
not
consider
the
information
outside
the
sub-trees
.
This
is
different
from
the
tree
kernel
in
Culota
and
Sorensen
(
2004
)
,
where
the
sub-trees
involved
in
the
tree
kernel
computation
are
context-sensitive
(
that
is
,
with
the
path
from
the
tree
root
node
to
the
sub-tree
root
node
in
consideration
)
.
Zhang
et
al
(
2006
)
also
showed
that
the
widely-used
Shortest
Path-enclosed
Tree
(
SPT
)
performed
best
.
One
problem
with
SPT
is
that
it
fails
to
capture
the
contextual
information
outside
the
shortest
path
,
which
is
important
for
relation
extraction
in
many
cases
.
Our
random
selection
of
100
positive
training
instances
from
the
ACE
RDC
2003
training
corpus
shows
that
~
25
%
of
the
cases
need
contextual
information
outside
the
shortest
path
.
Among
other
kernels
,
Bunescu
and
Mooney
(
2005b
)
proposed
a
subsequence
kernel
and
applied
it
in
protein
interaction
and
ACE
relation
extraction
tasks
.
In
order
to
integrate
the
advantages
of
feature-based
and
tree
kernel-based
methods
,
some
researchers
have
turned
to
composite
kernel-based
methods
.
Zhao
and
Grishman
(
2005
)
defined
several
feature-based
composite
kernels
to
integrate
diverse
features
for
relation
extraction
and
achieved
the
F-measure
of
70.4
on
the
7
relation
types
of
the
ACE
RDC
2004
corpus
.
Zhang
et
al
(
2006
)
proposed
two
composite
kernels
to
integrate
a
linear
kernel
and
Collins
and
Duffy
's
convolution
tree
kernel
.
It
achieved
the
F-measure
of
70.9
/
57.2
on
the
5
relation
types
/
24
relation
subtypes
in
the
ACE
RDC
2003
corpus
and
the
F-measure
of
72.1
/
63.6
on
the
7
relation
types
/
23
relation
subtypes
in
the
ACE
RDC
2004
corpus
.
The
above
discussion
suggests
that
structured
information
in
the
parse
tree
may
not
be
fully
utilized
in
the
previous
works
,
regardless
of
feature-based
,
tree
kernel-based
or
composite
kernel-based
methods
.
Compared
with
the
previous
works
,
this
paper
proposes
a
dynamic
context-sensitive
tree
span
trying
to
cover
necessary
structured
information
and
a
context-sensitive
convolution
tree
kernel
considering
both
context-free
and
context-sensitive
sub-trees
.
Further
-
more
,
a
composite
kernel
is
applied
to
combine
our
tree
kernel
and
a
state-of-the-art
linear
kernel
for
integrating
both
flat
and
structured
features
in
relation
extraction
as
well
as
validating
their
complementary
nature
.
3
Context
Sensitive
Convolution
Tree
Kernel
for
Relation
Extraction
In
this
section
,
we
first
propose
an
algorithm
to
dynamically
determine
a
proper
context-sensitive
tree
span
and
then
a
context-sensitive
convolution
tree
kernel
for
relation
extraction
.
3.1
Dynamic
Context-Sensitive
Tree
Span
in
Relation
Extraction
A
relation
instance
between
two
entities
is
encaps
u-lated
by
a
parse
tree
.
Thus
,
it
is
critical
to
understand
which
portion
of
a
parse
tree
is
important
in
the
tree
kernel
calculation
.
Zhang
et
al
(
2006
)
systematically
explored
seven
different
tree
spans
,
including
the
Shortest
Path-enclosed
Tree
(
SPT
)
and
a
Context-Sensitive
Path-enclosed
Tree1
(
CSPT
)
,
and
found
that
SPT
performed
best
.
That
is
,
SPT
even
outperforms
CSPT
.
This
is
contrary
to
our
intuition
.
For
example
,
"
got
married
"
is
critical
to
determine
the
relationship
between
"
John
"
and
"
Mary
"
in
the
sentence
"
John
and
Mary
got
married
.
"
as
shown
in
Figure
1
(
e
)
,
and
the
information
contained
in
SPT
(
"
John
and
Mary
"
)
is
not
enough
to
determine
their
relationship
.
Obviously
,
context-sensitive
tree
spans
should
have
the
potential
for
better
performance
.
One
problem
with
the
context-sensitive
tree
span
explored
in
Zhang
et
al
(
2006
)
is
that
it
only
considers
the
availability
of
entities
'
siblings
and
fails
to
consider
following
two
factors
:
1
)
Whether
is
the
information
contained
in
SPT
enough
to
determine
the
relationship
between
two
entities
?
It
depends
.
In
the
embedded
cases
,
SPT
is
enough
.
For
example
,
"
John
's
wife
"
is
enough
to
determine
the
relationship
between
"
John
"
and
"
John
's
wife
"
in
the
sentence
"
John
's
wife
got
a
good
job
.
"
as
shown
in
Figure
1
(
a
)
.
However
,
SPT
is
not
enough
in
the
coordinated
cases
,
e.g.
to
determine
the
relationship
between
"
John
"
and
"
Mary
"
in
the
sentence
"
John
and
Mary
got
married
.
"
as
shown
in
Figure
1
(
e
)
.
1
CSPT
means
SPT
extending
with
the
1st
left
sibling
of
the
node
of
entity
1
and
the
1st
right
sibling
of
the
node
of
entity
2
.
In
the
case
of
no
available
sibling
,
it
moves
to
the
parent
of
current
node
and
repeat
the
same
process
until
a
sibling
is
available
or
the
root
is
reached
.
Based
on
the
above
observations
,
we
implement
an
algorithm
to
determine
the
necessary
tree
span
for
the
relation
extract
task
.
The
idea
behind
the
algorithm
is
that
the
necessary
tree
span
for
a
relation
should
be
determined
dynamically
according
to
its
tree
span
category
and
context
.
Given
a
parsed
tree
and
two
entities
in
consideration
,
ti
first
determines
the
tree
span
category
and
then
extends
the
tree
span
accordingly
.
By
default
,
we
adopt
the
Shortest
Path-enclosed
Tree
(
SPT
)
as
our
tree
span
.
We
only
expand
the
tree
span
when
the
tree
span
belongs
to
the
"
predicate-linked
"
category
.
This
is
based
on
our
observation
that
the
tree
spans
belonging
to
the
"
predicate-linked
"
category
vary
much
syntactically
and
majority
(
~
70
%
)
of
them
need
information
outside
SPT
while
it
is
quite
safe
(
&gt;
90
%
)
to
use
SPT
as
the
tree
span
for
the
remaining
categories
.
In
our
algorithm
,
the
expansion
is
done
by
first
moving
up
until
a
predicate-headed
phrase
is
found
and
then
moving
down
along
the
predicated-headed
path
to
the
predicate
terminal
node
.
Figure
1
(
e
)
shows
an
example
for
the
"
predicate-linked
"
category
where
the
lines
with
arrows
indicate
the
expansion
path
.
wife
found
a
)
embedded
good
job
NP-E
)
^ER
I
/
|
NP
--
E2-ORGl
j
.
NN
IN
NNP
Microsoft
a
)
context-free
of
Microsoft
b
)
PP
-
linked
pP
(
IN
)
-
subroot
Microsoft
b
)
context-sensitive
California
c
)
semi-structured
mother
Lebanese
landed
d
)
descriptive
John
and
Mary
got
married
e
)
predicate-linked
:
SPT
and
the
dynamic
context-sensitive
tree
span
Figure
1
:
Different
tree
span
categories
with
SPT
(
dotted
circle
)
and
an
example
of
the
dynamic
context-sensitive
tree
span
(
solid
circle
)
c
)
context-sensitive
Figure
2
:
Examples
of
context-free
and
context-sensitive
subtrees
related
with
Figure
1
(
b
)
.
Note
:
the
bold
node
is
the
root
for
a
sub-tree
.
A
problem
with
our
algorithm
is
how
to
determine
whether
an
entity
pair
belongs
to
the
"
predicate-linked
"
category
.
In
this
paper
,
a
simple
method
is
applied
by
regarding
the
"
predicate-linked
"
category
as
the
default
category
.
That
is
,
those
entity
pairs
,
which
do
not
belong
to
the
four
well
defined
and
easily
detected
categories
(
i.e.
embedded
,
PP-liked
,
semi-structured
and
descriptive
)
,
are
classified
into
the
"
predicate-linked
"
category
.
announced
Since
"
predicate-linked
"
instances
only
occupy
-20
%
of
cases
,
this
explains
why
SPT
performs
better
than
the
Context-Sensitive
Path-enclosed
Tree
(
CSPT
)
as
described
in
Zhang
et
al
(
2006
)
:
consistently
adopting
CSPT
may
introduce
too
much
noise
/
unnecessary
information
in
the
tree
kernel
.
3.2
Context-Sensitive
Convolution
Tree
Kernel
Given
any
tree
span
,
e.g.
the
dynamic
context-sensitive
tree
span
in
the
last
subsection
,
we
now
study
how
to
measure
the
similarity
between
two
trees
,
using
a
convolution
tree
kernel.A
convolution
kernel
(
Haussler
D.
,
1999
)
aims
to
capture
structured
information
in
terms
of
substructures
.
As
a
specialized
convolution
kernel
,
Collins
and
Duffy
's
convolution
tree
kernel
KC
(
T
[
,
T2
)
(
'
C
'
for
convolution
)
counts
the
number
of
common
sub-trees
(
substructures
)
as
the
syntactic
structure
similarity
between
two
parse
trees
T1
and
T2
(
Collins
and
Duffy
2001
)
:
1
)
If
the
context-free
productions
(
Context-Free
Grammar
(
CFG
)
rules
)
at
n1
and
n2
are
different
,
A
(
n1
,
n2
)
=
0
;
Otherwise
go
to
2
.
2
)
If
both
n
and
n2
are
POS
tags
,
A
(
n1
,
n2
)
=
1
xl
;
Otherwise
go
to
3
.
the
decay
factor
in
order
to
make
the
kernel
value
less
variable
with
respect
to
different
sub-tree
sizes
.
This
convolution
tree
kernel
has
been
successfully
applied
by
Zhang
et
al
(
2006
)
in
relation
extraction
.
However
,
there
is
one
problem
with
this
tree
kernel
:
the
sub-trees
involved
in
the
tree
kernel
computation
are
context-free
(
That
is
,
they
do
not
consider
the
information
outside
the
sub-trees
)
.
This
is
contrast
to
the
tree
kernel
proposed
in
Culota
and
Sorensen
(
2004
)
which
is
context-sensitive
,
that
is
,
it
considers
the
path
from
the
tree
root
node
to
the
sub-tree
root
node
.
In
order
to
integrate
the
advantages
of
both
tree
kernels
and
resolve
the
problem
in
Collins
and
Duffy
's
convolution
tree
kernel
,
this
paper
proposes
a
context-sensitive
convolution
tree
kernel
.
It
works
by
taking
ancestral
information
(
i.e.
the
root
node
path
)
of
sub-trees
into
consideration
:
•
N1
[
j
]
is
the
set
of
root
node
paths
with
length
i
in
tree
T
[
j
]
while
the
maximal
length
of
a
root
node
path
is
defined
by
m.
node
in
n1i
[
j
]
is
augmented
with
the
POS
tag
of
its
head
word
.
1
)
If
the
context-sensitive
productions
(
Context-Sensitive
Grammar
(
CSG
)
rules
with
root
node
n
\
[
2
]
are
different
,
return
A
(
n1
[
1
]
,
n1
[
2
]
)
=
0
;
Otherwise
go
to
Step
2
.
A
(
n1
[
1
]
,
n1
[
2
]
)
=
l
;
Otherwise
go
to
Step
3
.
That
is
,
each
node
n
encodes
the
identity
of
a
subtree
rooted
at
n
and
,
if
there
are
two
nodes
in
the
tree
with
the
same
label
,
the
summation
will
go
over
both
of
them
.
3
That
is
,
each
root
node
path
n1i
encodes
the
identity
of
a
context-sensitive
sub-tree
rooted
at
n1i
and
,
if
there
are
two
root
node
paths
in
the
tree
with
the
same
label
sequence
,
the
summation
will
go
over
both
of
them
.
where
ch
(
n1i
[
j
]
,
k
]
)
is
the
kth
context-sensitive
child
of
the
context-sensitive
sub-tree
rooted
at
n1i
[
j
]
with
#ch
(
n1i
[
j
]
)
the
number
of
the
context-sensitive
children
.
Here
,
l
(
0
&lt;
l
&lt;
1
)
is
the
decay
factor
in
order
to
make
the
kernel
value
less
variable
with
respect
to
different
sizes
of
the
context-sensitive
sub-trees
.
It
is
worth
comparing
our
tree
kernel
with
previous
tree
kernels
.
Obviously
,
our
tree
kernel
is
an
extension
of
Collins
and
Duffy
's
convolution
tree
kernel
,
which
is
a
special
case
of
our
tree
kernel
(
if
m
=
1
in
Equation
(
3
)
)
.
Our
tree
kernel
not
only
counts
the
occurrence
of
each
context-free
sub-tree
,
which
does
not
consider
its
ancestors
,
but
also
counts
the
occurrence
of
each
context-sensitive
sub-tree
,
which
considers
its
ancestors
.
As
a
result
,
our
tree
kernel
is
not
limited
by
the
constraints
in
previous
tree
kernels
(
as
discussed
in
Section
2
)
,
such
as
Collins
and
Duffy
(
2001
)
,
Zhang
et
al
(
2006
)
,
Culotta
and
Sorensen
(
2004
)
and
Bunescu
and
Mooney
(
2005a
)
.
Finally
,
let
's
study
the
computational
issue
with
our
tree
kernel
.
Although
our
tree
kernel
takes
the
context-sensitive
sub-trees
into
consideration
,
it
only
slightly
increases
the
computational
burden
,
compared
with
Collins
and
Duffy
's
convolution
tree
kernel
.
This
is
due
to
that
A
(
n1
[
1
]
,
n1
[
2
]
)
=
0
holds
for
the
majority
of
context-free
sub-tree
pairs
(
Collins
and
Duffy
2001
)
and
that
computation
for
context-sensitive
subtree
pairs
is
necessary
only
when
A
(
n1
[
1
]
,
n1
[
2
]
)
^
0
and
the
context-sensitive
subtree
pairs
have
the
same
root
node
path
(
i.e.
This
paper
uses
the
ACE
RDC
2003
and
2004
corpora
provided
by
LDC
in
all
our
experiments
.
4.1
Experimental
Setting
major
relation
types
and
24
relation
subtypes
.
All
the
reported
performances
in
this
paper
on
the
ACE
RDC
2004
data
,
containing
348
documents
and
4400
relation
instances
.
That
is
,
all
the
reported
performances
in
this
paper
on
the
ACE
RDC
2004
corpus
are
evaluated
using
5-fold
cross
validation
on
the
entire
corpus
.
Both
corpora
are
parsed
using
Charniak
's
parser
(
Charniak
,
2001
)
with
the
boundaries
of
all
the
entity
mentions
kept4
.
We
iterate
over
all
pairs
of
entity
mentions
occurring
in
the
same
sentence
to
generate
potential
relation
instances5
.
In
our
experimentation
,
SVM
(
SVMLight
,
Joachims
(
1998
)
)
is
selected
as
our
classifier
.
For
efficiency
,
we
apply
the
one
vs.
others
strategy
,
which
builds
K
classifiers
so
as
to
separate
one
class
from
all
others
.
The
training
parameters
are
chosen
using
cross-validation
on
the
ACE
RDC
2003
training
data
.
In
particular
,
l
in
our
tree
kernel
is
fine-tuned
to
0.5
.
This
suggests
that
about
50
%
discount
is
done
as
our
tree
kernel
moves
down
one
First
,
we
systematically
evaluate
the
context-sensitive
convolution
tree
kernel
and
the
dynamic
context-sensitive
tree
span
proposed
in
this
paper
.
Then
,
we
evaluate
the
complementary
nature
between
our
tree
kernel
and
a
state-of-the-art
linear
kernel
via
a
composite
kernel
.
Generally
different
feature-based
methods
and
tree
kernel-based
methods
have
their
own
merits
.
It
is
usually
easy
to
build
a
system
using
a
feature-based
method
and
achieve
the
state-of-the-art
performance
,
while
tree
kernel-based
methods
hold
the
potential
for
further
performance
improvement
.
Therefore
,
it
is
always
a
good
idea
to
integrate
them
via
a
composite
kernel
.
This
can
be
done
by
first
representing
all
entity
mentions
with
their
head
words
and
then
restoring
all
the
entity
mentions
after
parsing
.
Moreover
,
please
note
that
the
final
performance
of
relation
extraction
may
change
much
with
different
range
of
parsing
errors
.
We
will
study
this
issue
in
the
near
future
.
In
this
paper
,
we
only
measure
the
performance
of
relation
extraction
on
"
true
"
mentions
with
"
true
"
chaining
of
co-reference
(
i.e.
as
annotated
by
LDC
annotators
)
.
Moreover
,
we
only
model
explicit
relations
and
explicitly
model
the
argument
order
of
the
two
mentions
involved
.
Finally
,
we
compare
our
system
with
the
state-of-the-art
systems
in
the
literature
.
Context-Sensitive
Convolution
Tree
Kernel
In
this
paper
,
the
m
parameter
of
our
context-sensitive
convolution
tree
kernel
as
shown
in
Equation
(
3
)
indicates
the
maximal
length
of
root
node
paths
and
is
optimized
to
3
using
5fold
cross
validation
on
the
ACE
RDC
2003
training
data
.
Table
1
compares
the
impact
of
different
m
in
context-sensitive
convolution
tree
kernels
using
the
Shortest
Path-enclosed
Tree
(
SPT
)
(
as
described
in
Zhang
et
al
(
2006
)
)
on
the
major
relation
types
of
the
ACE
RDC
2003
and
2004
corpora
,
in
details
.
It
also
shows
that
our
tree
kernel
achieves
best
performance
on
the
test
data
using
SPT
with
m
=
3
,
which
outperforms
the
one
with
m
=
1
by
~
2.3
in
F-measure
.
This
suggests
the
parent
and
grandparent
nodes
of
a
sub-tree
contains
much
information
for
relation
extraction
while
considering
more
ancestral
nodes
may
not
help
.
This
may
be
due
to
that
,
although
our
experimentation
on
the
training
data
indicates
that
more
than
80
%
(
on
average
)
of
subtrees
has
a
root
node
path
longer
than
3
(
since
most
of
the
subtrees
are
deep
from
the
root
node
and
more
than
90
%
of
the
parsed
trees
in
the
training
data
are
deeper
than
6
levels
)
,
including
a
root
node
path
longer
than
3
may
be
vulnerable
to
the
full
parsing
errors
and
have
negative
impact
.
Table
1
also
evaluates
the
impact
of
entity-related
information
in
our
tree
kernel
by
attaching
entity
type
information
(
e.g.
"
PER
"
in
the
entity
node
1
of
Figure
1
(
b
)
)
into
both
entity
nodes
.
It
shows
that
such
information
can
significantly
improve
the
performance
by
~
6.0
in
F-measure
.
In
all
the
following
experiments
,
we
will
apply
our
tree
kernel
with
m
=
3
and
entity-related
information
by
default
.
Table
2
compares
the
dynamic
context-sensitive
tree
span
with
SPT
using
our
tree
kernel
.
It
shows
that
the
dynamic
tree
span
can
futher
improve
the
performance
by
~
1.2
in
F-measure6
.
This
suggests
the
usefulness
of
extending
the
tree
span
beyond
SPT
for
the
"
predicate-linked
"
tree
span
category
.
In
the
future
work
,
we
will
further
explore
expanding
the
dynamic
tree
span
beyond
SPT
for
the
remaining
tree
span
categories
.
Significance
test
shows
that
the
dynamic
tree
span
performs
s
tatistically
significantly
better
than
SPT
with
p-values
smaller
than
0.05
.
a
)
without
entity-related
information
b
)
with
entity-related
information
Table
1
:
Evaluation
of
context-sensitive
convolution
tree
kernels
using
SPT
on
the
major
relation
types
of
(
outside
the
parentheses
)
corpora
.
Tree
Span
Shortest
Path
-
enclosed
Tree
Dynamic
Context
-
Sensitive
Tee
Table
2
:
Comparison
of
dynamic
context-sensitive
tree
span
with
SPT
using
our
context-sensitive
convolution
tree
kernel
on
the
major
relation
types
of
the
ACE
RDC
2003
(
inside
the
parentheses
)
and
2004
(
outside
the
parentheses
)
corpora
.
18
%
of
positive
instances
in
the
ACE
RDC
2003
test
data
belong
to
the
predicate-linked
category
.
In
this
paper
,
a
composite
kernel
via
polynomial
interpolation
,
as
described
Zhang
et
al
(
2006
)
,
is
applied
to
integrate
the
proposed
context-sensitive
convolution
tree
kernel
with
a
state-of-the-art
linear
kernel
(
Zhou
et
al
2005
)
7
:
tree
kernel
respectively
while
Kp
(
•
)
is
the
polynomial
expansion
of
K
(
•
)
with
degree
d
=
2
,
i.e.
Kp
(
•
,
•
)
=
(
K
&gt;
)
+1
)
2
and
a
is
the
coefficient
(
a
is
set
to
0.3
using
cross-validation
)
.
Here
,
we
use
the
same
set
of
flat
features
(
i.e.
word
,
entity
type
,
mention
level
,
overlap
,
base
phrase
chunking
,
dependency
tree
,
parse
tree
and
semantic
information
)
as
Zhou
et
al
(
2005
)
.
Table
3
evaluates
the
performance
of
the
composite
kernel
.
It
shows
that
the
composite
kernel
much
further
improves
the
performance
beyond
that
of
either
the
state-of-the-art
linear
kernel
or
our
tree
kernel
and
achieves
the
F-measures
of
74.1
and
75.8
on
the
major
relation
types
of
the
ACE
RDC
2003
and
2004
corpora
respectively
.
This
suggests
that
our
tree
kernel
and
the
state-of-the-art
linear
kernel
are
quite
complementary
,
and
that
our
composite
kernel
can
effectively
integrate
both
flat
and
structured
features
.
Linear
Kernel
Context-Sensitive
Con
-
volution
Tree
Kernel
Composite
Kernel
Table
3
:
Performance
of
the
composite
kernel
via
polynomial
interpolation
on
the
major
relation
types
of
the
ACE
RDC
2003
(
inside
the
parentheses
)
and
2004
(
outside
the
parentheses
)
corpora
Comparison
with
Other
Systems
shortest
path
dependency
kernel
feature-based
Table
4
:
Comparison
of
difference
systems
on
the
ACE
RDC
2003
corpus
over
both
5
types
(
outside
the
parentheses
)
and
24
subtypes
(
inside
the
parentheses
)
composite
kernel
Ours
:
context-sensitive
convolution
tree
kernel
Table
5
:
Comparison
of
difference
systems
on
the
ACE
RDC
2004
corpus
over
both
7
types
(
outside
the
parentheses
)
and
23
subtypes
(
inside
the
parentheses
)
Finally
,
Tables
4
and
5
co9mpare
our
system
with
other
state-of-the-art
systems9
on
the
ACE
RDC
2003
and
2004
corpora
,
respectively
.
They
show
that
our
tree
kernel-based
system
outperforms
previous
tree
kernel-based
systems
.
This
is
largely
due
to
the
context-sensitive
nature
of
our
tree
kernel
which
resolves
the
limitations
of
the
previous
tree
kernels
.
They
also
show
that
our
tree
kernel-based
system
outperforms
the
state-of-the-art
feature-based
system
.
This
proves
the
great
potential
inherent
in
the
parse
tree
structure
for
relation
extraction
and
our
tree
kernel
takes
a
big
stride
towards
the
right
direction
.
Finally
,
they
also
show
that
our
composite
kernel-based
system
outperforms
other
composite
kernel-based
systems
.
5
Conclusion
Structured
parse
tree
information
holds
great
potential
for
relation
extraction
.
This
paper
proposes
a
context-sensitive
convolution
tree
kernel
to
resolve
two
critical
problems
in
previous
tree
kernels
for
relation
extraction
by
first
automatically
determining
a
dynamic
context-sensitive
tree
span
and
then
applying
a
context-sensitive
convolution
tree
kernel
.
Moreover
,
this
paper
evaluates
the
complementary
nature
between
our
tree
kernel
and
a
state-of-the-art
linear
kernel
.
Evaluation
on
the
ACE
RDC
corpora
shows
that
our
dynamic
context-sensitive
tree
span
is
much
more
suitable
for
relation
extraction
than
the
widely-used
Shortest
Path-enclosed
Tree
and
our
tree
kernel
outperforms
the
state-of-the-art
Collins
and
Duffy
's
convolution
tree
kernel
.
It
also
shows
that
feature-based
There
might
be
some
typing
errors
for
the
performance
reported
in
Zhao
and
Grishman
(
2005
)
since
P
,
R
and
F
do
not
match
.
All
the
state-of-the-art
systems
apply
the
entity-related
information
.
It
is
not
supervising
:
our
experiments
show
that
using
the
entity-related
information
gives
a
large
performance
improvement
.
and
tree
kernel-based
methods
well
complement
each
other
and
the
composite
kernel
can
effectively
integrate
both
flat
and
structured
features
.
To
our
knowledge
,
this
is
the
first
research
to
demonstrate
that
,
without
extensive
feature
engineering
,
an
individual
tree
kernel
can
achieve
much
better
performance
than
the
state-of-the-art
linear
kernel
in
relation
extraction
.
This
shows
the
great
potential
of
structured
parse
tree
information
for
relation
extraction
and
our
tree
kernel
takes
a
big
stride
towards
the
right
direction
.
For
the
future
work
,
we
will
focus
on
improving
the
context-sensitive
convolution
tree
kernel
by
exploring
more
useful
context
information
.
Moreover
,
we
will
explore
more
entity-related
information
in
the
parse
tree
.
Our
preliminary
work
of
including
the
entity
type
information
significantly
improves
the
performance
.
Finally
,
we
will
study
how
to
resolve
the
data
imbalance
and
sparseness
issues
from
the
learning
algorithm
viewpoint
.
Acknowle
dgement
This
research
is
supported
by
Project
60673041
under
the
National
Natural
Science
Foundation
of
China
and
Project
2006AA01Z147
under
the
"
863
"
National
High-Tech
Research
and
Development
of
China
.
We
would
also
like
to
thank
the
critical
and
insightful
comments
from
the
four
anonymous
reviewers
.
Kambhatla
N.
(
2004
)
.
Combining
lexical
,
syntactic
and
semantic
features
with
Maximum
Entropy
models
for
extracting
relations
.
ACL'2004
(
Poster
)
.
178-181
.
2126
July
2004
.
Barcelona
,
Spain
.
Zelenko
D.
,
Aone
C.
and
Richardella
.
(
2003
)
.
Kernel
methods
for
relation
extraction
.
Journal
of
Machine
Learning
Research
.
3
(
Feb
)
:
1083-1106
.
gan
,
USA
.
Zhou
G.D.
,
Su
J.
and
Zhang
M.
(
2006
)
.
Modeling
commonality
among
related
classes
in
relation
extraction
,
COLING-ACL'2006
:
121-128
.
Sydney
,
Australia
.
