We
investigate
methods
to
improve
the
recall
in
coreference
resolution
by
also
trying
to
resolve
those
definite
descriptions
where
no
earlier
mention
of
the
referent
shares
the
same
lexical
head
(
coreferent
bridging
)
.
The
problem
,
which
is
notably
harder
than
identifying
coreference
relations
among
mentions
which
have
the
same
lexical
head
,
has
been
tackled
with
several
rather
different
approaches
,
and
we
attempt
to
provide
a
meaningful
classification
along
with
a
quantitative
comparison
.
Based
on
the
different
merits
of
the
methods
,
we
discuss
possibilities
to
improve
them
and
show
how
they
can
be
effectively
combined
.
1
Introduction
Coreference
resolution
,
the
task
of
grouping
mentions
in
a
text
that
refer
to
the
same
referent
in
the
real
world
,
has
been
shown
to
be
beneficial
for
a
number
of
higher-level
tasks
such
as
information
extraction
(
McCarthy
and
Lehnert
,
1995
)
,
question
answering
(
Morton
,
2000
)
and
summarisation
(
Stein-berger
et
al.
,
2005
)
.
While
the
resolution
of
pronominal
anaphora
and
tracking
of
named
entities
is
possible
with
good
accuracy
,
the
resolution
of
definite
NPs
(
having
a
common
noun
as
their
head
)
is
usually
limited
to
the
cases
that
Vieira
and
Poesio
(
2000
)
call
direct
coreference
,
where
both
coreferent
mentions
have
the
same
head
.
The
other
cases
,
called
coreferent
bridging
by
Vieira
and
Poesio1
,
are
notably
harder
because
the
number
of
potential
candidates
is
much
larger
when
it
is
no
longer
possible
to
rely
on
surface
similarity
.
To
overcome
the
limit
of
recall
that
is
encountered
when
only
relying
on
surface
features
,
newer
systems
for
coreference
resolutions
(
Daume
III
and
Marcu
,
2005
;
Ponzetto
and
Strube
,
2006
;
Versley
,
2006
;
Ng
,
2007
,
inter
alia
)
use
lexical
semantic
information
as
an
indication
for
semantic
compatibility
in
the
absence
of
head
equality
.
Most
current
systems
integrate
the
identification
of
discourse-new
definites
(
i.e.
,
cases
like
"
the
sun
"
or
"
the
man
that
Ben
met
yesterday
"
,
which
are
definite
,
but
not
anaphoric
)
with
the
antecedent
selection
proper
,
which
implies
that
the
gain
obtained
for
new
features
is
dependent
on
the
feature
's
usefulness
both
in
finding
semantically
related
mentions
and
for
the
use
in
detecting
discourse-new
definites
.
One
goal
of
this
paper
is
to
provide
a
better
understanding
of
these
information
sources
by
comparing
proposed
(
and
partly
new
)
approaches
for
resolving
coreferent
bridging
by
separately
considering
the
task
of
antecedent
selection
(
i.e.
,
presupposing
that
discourse-new
markables
have
been
identified
beforehand
)
.
Although
state
of
the
art
methods
for
modular
discourse-new
detection
(
Uryupina
,
2003
;
Poesio
et
al.
,
2005
)
do
not
achieve
near-perfect
accuracy
for
discourse-new
detection
,
the
results
we
give
for
antecedent
selection
represent
an
upper
bound
on
recall
and
precision
for
the
full
coreference
task
,
and
we
think
that
this
upper
bound
will
be
useful
for
Lascarides
,
1998
)
is
a
much
broader
concept
,
the
term
'
coreferent
bridging
'
is
potentially
confusing
,
as
many
cases
are
examples
of
perfectly
well-behaved
anaphoric
definite
noun
phrases
.
Because
we
want
to
emphasise
the
important
difference
to
the
more
easily
resolved
cases
of
same-head
coreference
,
we
will
stick
with
'
coreferent
bridging
'
as
the
only
term
that
has
been
established
for
this
in
the
literature
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
496-505
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
the
design
of
features
in
both
systems
using
a
modular
approach
,
such
as
(
Poesio
et
al.
,
2005
)
,
where
the
decision
on
discourse-newness
is
taken
beforehand
,
and
those
that
integrate
discourse-new
classification
with
the
actual
resolution
of
coreferent
bridging
cases
.
In
contrast
to
earlier
investigations
(
Mark-ert
and
Nissim
,
2005
;
Garera
and
Yarowsky
,
2006
)
,
we
provide
a
more
extensive
overview
on
features
and
also
discuss
properties
that
influence
their
com-binability
.
Several
approaches
have
been
proposed
for
the
treatment
of
coreferent
bridging
.
Poesio
et
al.
(
1997
)
use
WordNet
,
looking
for
a
synonymy
or
hypernymy
relation
(
additionally
,
for
coordinate
sisters
in
Word-Net
)
.
The
system
of
Cardie
and
Wagstaff
(
1999
)
uses
the
node
distance
in
WordNet
(
with
an
upper
limit
of
4
)
as
one
component
in
the
distance
measure
that
guides
their
clustering
algorithm
.
Harabagiu
et
al.
(
2001
)
use
paths
through
Wordnet
,
using
not
only
synonym
and
is-a
relations
,
but
also
parts
,
morphological
derivations
,
gloss
texts
and
polysemy
,
which
are
weighted
with
a
measure
based
on
the
relation
types
and
number
of
path
elements
.
Other
approaches
use
large
corpora
to
get
an
indication
for
bridging
relations
:
Poesio
et
al.
(
1998
)
use
a
general
word
association
metric
based
on
common
terms
oc-curing
in
a
fixed-width
window
,
Gasperin
and
Vieira
(
2004
)
use
syntactic
contexts
of
words
in
a
large
corpus
to
induce
a
semantic
similarity
measure
(
similar
to
the
one
introduced
by
Lin
,
1998
)
,
and
then
use
lists
of
the
n
nouns
that
are
(
globally
)
most
similar
to
a
given
noun
.
Markert
and
Nissim
(
2005
)
mine
the
World
Wide
Web
for
shallow
patterns
like
"
China
and
other
countries
"
,
indicating
an
is-a
relationship
.
Finally
,
Garera
and
Yarowsky
(
2006
)
propose
an
association-based
approach
using
nouns
that
occur
in
a
2-sentence
window
before
a
definite
description
that
has
no
same-head
antecedent
.
1.1
Lexical
vs.
Referential
Relations
One
important
property
of
these
information
sources
is
the
kind
of
lexical
relations
that
they
detect
.
The
lexical
relations
that
we
expect
in
coreferent
bridging
cases
are
:
•
instance
:
The
antecedent
is
an
instance
of
the
concept
denoted
by
the
anaphor
•
synonymy
:
The
antecedent
and
the
anaphor
are
synonyms
•
hyperonymy
:
The
anaphor
is
a
strict
generalisation
of
the
antecedent
•
near-synonymy
:
The
anaphor
and
antecedent
are
semantically
related
but
not
synonyms
in
the
strict
sense
Of
course
,
not
all
cases
of
coreferent
bridging
realise
such
a
lexical
relation
,
as
sometimes
the
anaphor
takes
up
information
introduced
elsewhere
than
in
the
lexical
noun
phrase
head
(
Peter
was
found
dead
in
his
flat
.
.
.
the
deceased
)
,
or
the
coreference
relation
is
forced
by
the
discourse
structure
,
without
the
items
being
lexically
related
.
As
an
illustrating
example
,
in
(
1
)
John
walked
towards
[
1
the
house
]
.
(
2
)
a.
[
1
The
building
]
was
illuminated
.
b.
[
1
The
manor
]
was
guarded
by
dogs
.
c.
[
2
The
door
]
was
open
.
Typical
cases
of
coreference
include
cases
like
1,2a
(
hypernym
)
or
1,2b
(
compatible
but
non-synonymous
term
)
.
The
discourse
in
1,2c
is
an
example
of
associative
bridging
between
the
NP
"
the
door
"
and
its
antecedent
to
"
the
house
"
;
it
is
inferred
that
the
door
must
be
part
of
the
house
mentioned
earlier
(
since
doors
are
typically
part
of
a
house
)
,
which
is
not
compatible
with
coreferent
bridging
,
but
is
also
ranked
highly
by
association
measures
.
While
hypernym
relations
(
as
found
by
hypernym
lookup
in
WordNet
,
or
patterns
indicating
such
relations
in
unannotated
texts
)
are
usually
a
strong
indicator
of
coreference
,
they
can
only
cover
some
of
the
cases
,
while
the
near-synonymous
cases
are
left
undiscovered
.
Similarity
and
association
measures
can
help
for
the
cases
of
near-synonymy
.
However
,
while
similarity
measures
(
such
as
WordNet
distance
or
Lin
's
similarity
metric
)
only
detect
cases
of
semantic
similarity
,
association
measures
(
such
as
the
ones
used
by
Poesio
et
al.
,
or
by
Garera
and
Yarowsky
)
also
find
cases
of
associative
bridg
-
Land
(
country
/
state
/
land
)
Kemalismus
Regierung
Kontinent
Kemalism
government
continent
Bauernfamilie
Präsident
agricultural
family
president
Landesregierung
Bankgesellschaft
country
government
banking
corporation
Bundesrepublik
Bundesregierung
Albanien
federal
republic
federal
government
Republik
Gewerkschaft
Hauptstadt
Bundesland
republic
trade
union
Medikament
(
medical
drug
)
Arzneimittel
pharmaceutical
drug
(
non-medical
)
Pestizid
Pharmakonzern
Behandlung
pesticide
pharmaceutical
companytreatment
treatment
Lebensmittel
Präparat
Abtreibungspille
foodstuff
preparation
abortion
pill
highest
ranked
words
,
with
very
rare
words
removed
TheY
:
association
measure
introduced
by
Garera
and
Yarowsky
(
2006
)
TheY
:
G2
:
similar
method
using
a
log-likelihood-based
statistic
(
see
Dunning
1993
)
this
statistic
has
a
preference
for
higher-frequency
terms
PL03
:
semantic
space
association
measure
proposed
by
Pado
and
Lapata
(
2003
)
Table
1
:
Similarity
and
association
measures
:
most
similar
items
ing
like
1a
,
b
;
the
result
of
this
can
be
seen
in
table
(
2
)
:
while
the
similarity
measures
(
Lin98
,
RFF
)
list
substitutable
terms
(
which
behave
like
synonyms
in
many
contexts
)
,
the
association
measures
(
Garera
and
Yarowsky
's
TheY
measure
,
Pado
and
Lapata
's
association
measure
)
also
find
non-compatible
associations
such
as
country-capital
or
drug-treatment
,
which
is
why
they
are
commonly
called
relation-free
.
For
the
purpose
of
coreference
resolution
,
however
we
do
not
want
to
resolve
"
the
door
"
to
the
antecedent
"
the
house
"
as
the
two
descriptions
do
not
corefer
,
and
it
may
be
useful
to
filter
out
non-similar
associations
.
1.2
Information
Sources
Different
resources
may
be
differently
suited
for
the
recognition
of
the
various
relations
.
Generally
,
it
would
be
expected
that
using
a
wordnet
is
the
best
solution
if
we
are
interested
in
an
isalike
relation
between
two
words
.
On
the
other
hand
,
wordnets
usually
have
limited
coverage
both
in
terms
of
lexical
items
and
in
terms
of
relations
encoded
(
as
their
construction
is
necessarily
laborintensive
)
,
and
-
as
Markert
and
Nissim
remark
-
they
do
not
(
and
arguably
should
not
)
contain
context-dependent
relations
that
do
not
hold
generally
but
only
in
some
rather
specific
context
,
for
example
steel
being
anaphorically
described
as
a
commodity
in
a
financial
text
.
Context-dependent
relations
,
Markert
and
Nissim
argue
,
can
be
found
using
shallow
patterns
(
for
example
,
steel
and
other
commodities
)
,
since
a
use
in
such
a
context
would
mean
that
the
idiosyncratic
conceptual
relation
holds
in
that
context
.
Wordnets
also
have
usually
have
poor
(
or
non-existant
)
coverage
of
named
entities
,
which
are
especially
relevant
for
instance
relations
;
this
kind
of
instance
relations
can
often
be
found
in
large
text
corpora
.
The
high-precision
patterns
that
Mark-ert
and
Nissim
use
only
occur
infrequently
,
but
the
approach
using
shallow
patterns
allows
to
perform
the
search
of
the
World
Wide
Web
,
which
somewhat
alleviates
the
sparse
data
problem
.
While
some
near-synonyms
can
be
found
by
looking
at
the
distance
in
a
wordnet
,
they
may
be
far
apart
from
each
other
because
of
ontological
modeling
decisions
,
or
lexical
items
not
covered
by
the
wordnet
.
Similarity
and
association
measures
can
provide
greater
coverage
for
these
near-synonym
relations
.
The
measures
both
of
Lin
(
1998
)
and
of
Pado
and
Lapata
(
2003
,
2007
)
are
distributional
methods
;
for
each
word
,
they
create
a
distribution
of
the
contexts
they
occur
in
,
and
similarity
between
two
words
is
calculated
as
the
similarity
of
these
distributions.2
The
difference
in
these
two
methods
is
the
representation
of
the
contexts
.
While
Lin
uses
contexts
that
are
expected
to
determine
semantic
preferences
(
like
being
in
the
direct
object
position
of
one
verb
)
,
Pado
and
Lapata
only
use
the
co-occuring
words
,
weighted
by
syntax-based
distance
.
For
example
,
in
(
3
)
Peter
^
likes
d
*
°-b
ice-cream
.
Lin
's
approach
would
yield
]
subj
:
like
for
Peter
and
]
dobj
:
like
for
ice-cream
,
while
Pado
and
Lapata
's
approach
would
yield
the
contexts
like
(
with
a
weight
of
1.0
)
and
ice-cream
(
with
a
weight
of
0.5
)
for
Peter
.
As
a
consequence
,
Pado
and
Lapata
's
measure
is
more
robust
against
data
sparseness
but
also
finds
related
non-similar
terms
(
which
are
ultimately
unwanted
for
coreference
resolution
)
.
Pado
and
Lapata
show
their
dependency-based
measure
to
perform
better
in
a
word
sense
disambiguation
task
than
the
measure
of
Lund
et
al.
(
1995
)
,
on
which
Poesio
et
al.
(
1998
)
based
their
experiments
and
which
is
based
on
the
surface
distance
of
words
.
We
also
reimplemented
the
approach
of
Garera
and
Yarowsky
(
2006
)
,
who
extract
potential
anaphor-antecedent
pairs
from
unlabeled
texts
and
rank
these
potentially
related
pairs
by
the
mutual
information
statistic
.
As
an
example
,
in
a
text
like
(
4
)
Peter
likes
ice-cream
.
The
boy
devours
tons
of
it
.
2Both
measures
use
a
weighted
Jaccard
metric
on
mutual
information
vectors
to
calculate
the
similarity
.
See
Weeds
and
Weir
(
2005
)
for
an
overview
of
other
measures
.
we
would
extract
the
pairs
(
boy
,
(
person
)
)
and
(
boy
,
ice-cream
)
,
in
the
hope
that
the
former
pair
occurs
comparatively
more
often
and
gets
a
higher
mutual
information
value
.
2
Experiments
on
Antecedent
Selection
In
a
setting
similar
to
Markert
and
Nissim
(
2005
)
,
we
evaluate
the
precision
(
proportion
of
correct
cases
in
the
resolved
cases
)
and
recall
(
correct
cases
to
all
cases
)
for
the
resolution
of
discourse-old
definite
noun
phrases
.
Before
trying
to
resolve
coref-erent
bridging
cases
,
we
look
for
compatible
antecedent
candidates
with
the
same
lexical
head
and
resolve
to
the
nearest
such
candidate
if
there
is
one
.
For
our
experiments
,
we
used
the
first
125
articles
of
the
coreferentially
annotated
TiiBa-D
/
Z
corpus
of
written
newspaper
text
(
Hinrichs
et
al.
,
2005
)
,
totalling
2239
sentences
with
633
discourse-old
definite
descriptions
,
and
the
latest
release
of
GermaNet
(
Kunze
and
Lemnitzer
,
2002
)
,
which
is
the
German-language
part
of
EuroWordNet
.
Unlike
Markert
and
Nissim
,
we
did
not
limit
the
evaluation
to
discourse-old
noun
phrases
where
an
antecedent
is
in
the
4
preceding
sentences
,
but
also
included
cases
where
the
antecedent
is
further
away
.
As
a
real
coreference
resolution
system
would
have
to
either
resolve
them
correctly
or
leave
them
unresolved
,
we
feel
that
this
is
less
unrealistic
and
thus
preferable
even
when
it
gives
less
optimistic
evaluation
results
.
Because
overall
precision
is
a
mixture
of
the
precision
of
the
same-head
resolver
and
the
precision
of
the
resolution
for
coreferent
bridging
,
which
is
lower
than
that
for
same-head
cases
,
we
forcibly
get
less
precision
if
we
resolve
more
coref-erent
bridging
cases
.
As
it
is
always
possible
to
improve
overall
precision
by
resolving
fewer
cases
of
coreferent
bridging
,
we
separately
mention
the
precision
for
coreferent
bridging
cases
alone
(
i.e.
,
number
of
correct
coreferent
bridging
cases
by
all
resolved
coreferent
bridging
cases
)
,
which
we
deem
more
informative
.
In
our
evaluation
,
we
included
hypernymy
search
and
a
simple
edge-based
distance
based
on
Ger-maNet
,
as
well
as
a
baseline
using
semantic
classes
(
automatically
determined
by
a
combination
of
simple
named
entity
classification
and
GermaNet
sub-sumption
)
,
as
well
as
an
evolved
version
of
Markert
same-head
Prec.NSH
:
precision
for
coreferent
bridging
cases
(
1
)
:
consider
candidates
in
the
4
preceding
sentences
Table
2
:
Baseline
results
and
Nissim
's
approach
,
which
is
presented
in
(
Ver-sley
,
2007
)
.
For
the
methods
based
on
similarity
and
association
measures
,
we
implemented
a
simple
ranking
by
the
respective
similarity
or
relatedness
value
.
Additionally
,
we
included
an
approach
due
to
Gasperin
and
Vieira
(
2004
)
,
who
tackle
the
problem
of
similarity
by
using
lists
of
most
similar
words
to
a
certain
word
,
based
on
a
similarity
measure
closely
related
to
Lin
's
.
They
allow
resolution
if
either
(
i
)
the
candidate
is
among
the
words
most
similar
to
the
anaphor
,
(
ii
)
the
anaphor
is
among
the
words
most
similar
to
the
candidate
,
(
iii
)
the
similarity
lists
of
anaphor
and
candidate
share
a
common
item
.
We
tried
out
several
variations
in
the
length
of
the
similar
words
list
(
Gasperin
and
Vieira
used
15
,
we
also
tried
lists
with
25
,
50
and
100
items
)
.
The
third
possibility
that
Gasperin
and
Vieira
mention
(
a
common
item
in
the
similarity
lists
of
both
anaphor
and
antecedent
)
resolves
some
correct
cases
,
but
leads
to
a
much
larger
number
of
false
positives
,
which
is
why
we
did
not
include
it
in
our
evaluation
.
To
induce
the
similarity
and
association
measures
presented
earlier
,
we
used
texts
from
the
German
newspaper
die
tageszeitung
,
comprising
about
11M
sentences
.
For
the
extraction
of
anaphor-antecedent
candidates
,
we
used
a
chunked
version
of
the
corpus
(
Muller
and
Ule
,
2002
)
.
The
identification
of
grammatical
relations
,
was
carried
out
on
a
subset
of
all
sentences
(
those
with
length
&lt;
30
)
,
with
an
unlexicalised
PCFG
parser
and
subsequent
extraction
of
dependency
relations
(
Versley
,
2005
)
.
For
the
last
approach
,
where
dependency
relations
were
needed
but
labeling
accuracy
was
not
as
important
,
we
used
a
deterministic
shift-reduce
parser
that
Foth
and
Menzel
(
2006
)
used
as
input
source
in
hybrid
dependency
parsing.3
For
all
three
approaches
,
we
lemmatised
the
words
by
using
a
combination
of
SMOR
(
Schmid
et
al.
,
2004
)
,
a
derivational
finite-state
morphology
for
German
,
and
lexical
information
derived
from
the
lexicon
of
a
German
dependency
parser
(
Foth
and
Menzel
,
2006
)
.
We
mitigated
the
problem
ofvo-cabulary
growth
in
the
lexicon
,
due
to
German
synthetic
compounds
,
by
using
a
frequency-sensitive
unsupervised
compound
splitting
technique
,
and
(
for
semantic
similarity
)
normalised
common
person
and
location
names
to
'
(
person
)
'
and
'
(
location
)
'
,
respectively
.
Same-head
resolution
(
including
a
check
for
modifier
compatibility
)
allows
to
correctly
resolve
49.8
%
of
all
cases
,
with
a
precision
of
86.5
%
.
The
most
simple
approach
for
coreferent
bridging
,
just
resolving
coreferent
bridging
cases
to
the
nearest
possible
antecedent
(
only
checking
for
number
agreement
)
,
yields
very
poor
precision
(
12
%
for
the
coreferent
bridging
cases
)
,
and
as
a
result
,
the
recall
gain
is
very
limited
.
If
we
use
semantic
classes
(
based
on
both
GermaNet
and
a
simple
classification
for
named
entities
)
to
constrain
the
candidates
and
then
use
the
nearest
number
-
and
gender-compatible
antecedent4
,
we
get
a
much
better
precision
(
35
%
for
coreferent
bridging
cases
)
,
and
a
much
better
recall
of
61.1
%
.
Hyponymy
lookup
in
GermaNet
,
without
a
limit
on
sentence
distance
,
achieves
a
recall
of
57.5
%
(
with
a
precision
of
67
%
for
the
resolved
coreferent
bridging
cases
)
,
whereas
using
the
best
single
pattern
(
Y
wie
X
,
which
corresponds
to
3Arguably
,
it
would
have
been
more
convenient
to
use
a
single
parser
for
all
three
approaches
,
but
differing
tradeoffs
between
speed
on
one
hand
and
accuracy
for
relevant
information
and
/
or
fitness
of
representation
on
the
other
hand
made
the
respective
parser
or
chunker
a
compelling
choice
.
4In
German
,
grammatical
gender
is
not
as
predictive
as
in
English
as
it
does
not
reproduce
ontological
distinctions
.
For
persons
,
grammatical
and
natural
gender
almost
always
coincide
,
and
we
check
gender
equality
iff
the
anaphor
is
a
person
.
the
English
Ys
such
as
X
)
,
with
a
distance
limit
of
4
sentences5
,
on
the
Web
only
improves
the
recall
to
54.3
%
(
with
a
lower
precision
of
55
%
for
coref-erent
bridging
cases
)
.
This
is
in
contrast
to
the
results
of
Markert
and
Nissim
,
who
found
that
Web
pattern
search
performs
better
than
wordnet
lookup
;
see
(
Versley
,
2007
)
for
a
discussion
.
Ranking
all
candidates
that
are
within
a
distance
of
4
hyper
-
/
hyponymy
edges
in
GermaNet
by
their
edge
distance
,
we
get
a
relatively
good
recall
of
60.5
%
,
but
the
precision
(
for
the
coreferent
bridging
cases
)
is
only
at
39
%
,
which
is
quite
poor
in
comparison
.
The
results
for
Garera
and
Yarowsky
's
TheY
algorithm
are
quite
disconcerting
-
recall
and
the
precision
on
coreferent
bridging
cases
are
lower
than
the
respective
baseline
using
(
wordnet-based
)
semantic
class
information
or
Pado
and
Lapata
's
association
measure
.
The
technique
based
on
Lin
's
similarity
measure
does
outperform
the
baseline
,
but
still
suffers
from
bad
precision
,
along
with
Pado
and
Lapata
's
association
measure
.
In
other
words
,
the
similarity
and
association
measures
seem
to
be
too
noisy
to
be
used
directly
for
ranking
antecedents
.
The
approach
of
Gasperin
and
Vieira
performs
comparably
to
the
approach
using
Web-based
pattern
search
(
although
the
precision
is
poorer
than
for
the
best-performing
pattern
for
German
,
"
X
wie
Y
"
-
X
such
as
Y
,
it
is
comparable
to
that
of
other
patterns
)
.
2.1
Improving
Distributional
Similarity
?
While
it
would
be
naive
to
think
that
the
methods
purely
based
on
statistical
similarity
measures
could
reach
the
accuracy
that
can
be
achieved
with
a
hand-constructed
lexicalised
ontology
,
it
would
of
course
be
nice
if
we
could
improve
the
quality
of
the
semantic
similarity
measure
used
in
ranking
and
the
most-similar-word
lists
.
Geffet
and
Dagan
(
2004
)
propose
an
approach
to
improve
the
quality
of
the
feature
vectors
used
in
distributional
similarity
measures
:
instead
of
weighting
features
using
the
mutual
information
value
between
the
word
and
the
feature
,
they
propose
to
use
a
measure
they
call
Relative
Feature
Focus
:
the
sum
of
the
similarities
to
the
(
globally
)
most
5There
is
a
degradation
in
precision
for
the
pattern-based
approach
,
but
not
for
the
GermaNet-based
approach
,
which
is
why
we
do
not
use
a
distance
limit
for
the
GermaNet-based
approach
.
similar
words
that
share
this
feature
.
By
replacing
mutual
information
values
with
RFF
values
in
Lin
's
association
measure
,
Geffet
and
Da-gan
were
able
to
significantly
improve
the
proportion
of
substitutable
words
in
the
list
of
the
most
similar
words
.
In
our
experiments
,
however
,
using
the
RFF-based
similarity
measure
did
not
improve
the
similarity-list-based
resolution
or
the
simple
ranking
,
to
the
contrary
,
both
recall
and
precision
are
less
than
for
the
Weighted
Jaccard
measure
that
we
used
originally.6
We
attribute
this
to
two
factors
:
Firstly
,
Geffet
and
Dagan
's
evaluation
emphasises
the
precision
in
terms
of
types
,
whereas
the
use
in
resolving
coref-erent
bridging
does
not
punish
unrelated
rare
words
being
ranked
high
-
since
these
are
rare
,
the
likelihood
that
they
occur
together
,
changing
a
resolution
decision
,
is
quite
low
,
whereas
rare
related
words
that
are
ranked
high
can
allow
a
correct
resolution
.
Secondly
,
Geffet
and
Dagan
focus
on
high-frequency
words
,
which
makes
sense
in
the
context
of
ontology
learning
,
but
the
applicability
for
tasks
like
coreference
resolution
(
directly
or
in
the
approach
of
Gasperin
and
Vieira
)
also
depends
on
a
sensible
treatment
of
lower-frequency
words
.
Using
the
framework
of
Weeds
et
al.
(
2004
)
,
we
found
that
the
bias
of
lower
frequency
words
for
preferring
high-frequency
neighbours
was
higher
for
RFF
(
0.58
against
0.35
for
Lin
's
measure
)
.
Weeds
and
Weir
(
2005
)
discuss
the
influence
of
bias
towards
high
-
or
low-frequency
items
for
different
tasks
(
correlation
with
WordNet-derived
neighbour
sets
and
pseudoword
disambiguation
)
,
and
it
would
not
be
surprising
if
the
different
high-frequency
bias
were
leading
to
different
results
.
2.2
Combining
Information
Sources
The
information
sources
that
we
presented
earlier
and
the
corpus-based
methods
based
on
similarity
or
association
measures
draw
from
different
kinds
of
evidence
and
thus
should
be
rather
complementary
.
To
put
it
another
way
,
it
should
be
possible
to
get
the
best
from
all
methods
,
achieving
the
recall
of
the
high-recall
methods
(
like
using
semantic
class
in
-
6Simple
ranking
with
RFF
gives
a
precision
of
33
%
for
coreferent
bridging
cases
,
against
39
%
for
Lin
's
original
measure
;
for
an
approach
based
on
similarity
lists
,
we
get
39
%
against
44
%
.
Recall
(
total
)
sem
.
class+gender
checking
TheY+sem+gend+Bnd
Lin+sem+gend+Bnd
GermaNet
X
all
patterns
X
LinBnd
X
Lin
X
TheY+sem+gend
in
the
antecedent
's
similarity
list
Table
3
:
Combination-based
approaches
formation
,
or
similarity
and
association
measures
)
,
with
a
precision
closer
to
the
most
precise
method
using
GermaNet
.
In
the
case
of
web-based
patterns
,
Versley
(
2007
)
combines
several
pattern
searches
on
the
web
and
uses
the
combined
positive
and
negative
evidence
to
compute
a
composite
score
-
with
a
suitably
chosen
cutoff
,
it
outperforms
all
single
patterns
both
in
terms
of
precision
and
recall
.
First
resolving
via
hyponymy
in
GermaNet
and
then
using
the
pattern-combination
approach
outperforms
the
semantic
class-based
baseline
in
terms
of
recall
and
is
reasonably
close
to
the
GermaNet-based
approach
in
terms
of
precision
(
i.e.
,
much
better
than
the
approach
based
only
on
the
semantic
class
)
.
As
a
first
step
to
improve
the
precision
of
the
corpus-based
approaches
,
we
added
filtering
based
on
automatically
assigned
semantic
classes
(
persons
,
organisations
,
events
,
other
countable
objects
,
and
everything
else
)
.
Very
surprisingly
,
Garera
and
Yarowsky
's
TheY
approach
,
despite
starting
out
at
a
lower
precision
(
31
%
,
against
39
%
for
Lin
and
42
%
for
PL03
)
,
profits
much
more
from
the
semantic
filter
and
reaches
the
best
precision
(
47
%
)
,
whereas
Lin
's
semantic
similarity
measure
profits
the
least
.
Since
limiting
the
distance
to
the
4
previous
sentences
had
quite
a
devastating
effect
for
the
approach
based
on
Lin
's
similarity
measure
(
which
achieves
39
%
precision
when
all
the
candidates
are
available
and
30
%
precision
if
it
choses
the
most
se-mantically
similar
out
of
the
candidates
that
are
in
the
last
4
sentences
)
,
we
also
wanted
to
try
and
apply
the
distance-based
filtering
after
finding
seman-tically
related
candidates
.
The
approach
we
tried
was
as
follows
:
we
rank
all
candidates
using
the
similarity
function
,
and
keep
only
the
3
top-rated
candidates
.
From
these
3
top-rated
candidates
,
we
keep
only
those
within
the
last
4
sentences
.
Without
filtering
by
semantic
class
,
this
improves
the
precision
to
41
%
(
from
30
%
for
limiting
the
distance
beforehand
,
or
39
%
without
limiting
the
distance
)
.
Adding
filtering
based
on
semantic
classes
to
this
(
only
keeping
those
from
the
3
top-rated
candidates
which
have
a
compatible
semantic
class
and
are
within
the
last
4
sentences
)
,
we
get
a
much
better
precision
of
53
%
,
with
a
recall
that
can
still
be
seen
as
good
(
57.8
%
)
.
In
comparison
with
the
similarity-list-based
approach
,
we
get
a
much
better
precision
than
we
would
get
for
methods
with
comparable
recall
(
the
version
with
the
100
most
similar
items
has
44
%
precision
,
the
version
with
50
most
similar
items
and
matching
both
ways
has
46
%
precision
)
.
Applying
this
distance-bounding
method
to
Garera
and
Yarowsky
's
association
measure
still
leads
to
an
improvement
over
the
case
with
only
semantic
and
gender
checking
,
but
the
improvement
(
from
47
%
to
50
%
)
is
not
as
large
as
with
the
semantic
similarity
measure
or
Pado
and
Lapata
's
association
measure
(
from
45
%
to
57
%
)
.
For
the
final
system
,
we
back
off
from
the
most
precise
information
sources
to
the
less
precise
.
Starting
with
the
combination
of
GermaNet
and
pattern-based
search
on
the
World
Wide
Web
,
we
begin
by
adding
the
distance-bounded
semantic
similarity-based
resolver
(
LinBnd
)
and
resolution
based
on
the
list
of
25
most
similar
words
(
following
the
approach
of
Gasperin
and
Vieira
2004
)
.
This
results
in
visibly
improved
recall
(
from
62
%
to
68
%
)
,
while
the
precision
for
coreferent
bridging
cases
does
not
suffer
much
.
Adding
resolution
based
on
Lin
's
semantic
similarity
measure
and
Garera
and
Yarowsky
's
TheY
value
leads
to
a
further
improvement
in
recall
to
69.7
%
,
but
also
leads
to
a
larger
loss
in
precision
.
3
Conclusion
In
this
paper
,
we
compared
several
approaches
to
resolve
cases
of
coreferent
bridging
in
open-domain
newspaper
text
.
While
none
of
the
information
sources
can
match
the
precision
of
the
hypernymy
information
encoded
in
GermaNet
,
or
that
of
using
a
combination
of
high-precision
patterns
with
the
World
Wide
Web
as
a
very
large
corpus
,
it
is
possible
to
achieve
a
considerable
improvement
in
terms
of
recall
without
sacrificing
too
much
precision
by
combining
these
methods
.
Very
interestingly
,
the
distributional
methods
based
on
intra-sentence
relations
(
Lin
,
1998
;
Pado
and
Lapata
,
2003
)
outperformed
Garera
and
Yarowsky
's
(
2006
)
association
measure
when
used
for
ranking
,
which
may
due
to
sparse
data
problems
or
simply
too
much
noise
for
the
latter
.
For
the
association
measures
,
the
fact
that
they
are
relation-free
also
means
that
they
can
profit
from
added
semantic
filtering
.
The
novel
distance-bounded
semantic
similarity
method
(
where
we
use
the
most
similar
words
in
the
previous
discourse
together
with
a
semantic
class-based
filter
and
a
distance
limit
)
comes
near
the
precision
of
using
surface
patterns
,
and
offers
better
accuracy
than
Gasperin
and
Vieira
's
method
of
using
the
globally
most
similar
words
.
By
combining
existing
higher-precision
information
sources
such
as
hypernym
search
in
GermaNet
and
the
Web-based
approach
presented
in
(
Vers-ley
,
2007
)
together
with
similarity
-
and
association-based
resolution
,
it
is
possible
to
get
a
large
improvement
in
recall
even
compared
to
the
combined
GermaNet+Web
approach
or
an
approach
combining
GermaNet
with
a
semantically
filtered
version
of
Garera
and
Yarowsky
's
TheY
approach
.
In
independent
research
,
Goecke
et
al.
(
2006
)
combined
the
original
LSA-based
method
of
Lund
et
al.
(
1995
)
with
wordnet
relations
and
pattern
search
on
a
fixed-size
corpus.7
However
,
they
evaluate
only
on
a
small
subset
of
discourse-old
definite
descriptions
(
those
where
a
wordnet-compatible
semantic
relation
was
identified
and
which
were
reasonably
close
to
their
antecedent
)
,
and
they
did
not
distinguish
coreferent
from
associative
bridging
antecedents
.
Although
the
different
evaluation
method
disallows
a
meaningful
comparison
,
we
think
that
the
more
evolved
information
sources
we
use
(
Pado
and
Lapata
's
association
measure
instead
of
Lund
et
al
's
,
combined
pattern
search
on
the
World
Wide
Web
instead
of
search
for
patterns
in
a
fixed-size
corpus
)
,
as
well
as
the
additional
information
based
on
semantic
similarity
,
lead
to
superior
results
when
evaluated
in
a
comparable
task
.
Both
the
distributional
similarity
statistics
and
the
association
measure
can
profit
from
more
training
data
,
something
which
is
bound
by
availability
of
similar
text
(
Gasperin
et
al.
,
2004
point
out
that
using
texts
from
a
different
genre
strongly
limits
the
usefulness
of
the
learned
semantic
similarity
measure
)
,
and
by
processing
costs
(
which
are
more
serious
for
distributional
similarity
measures
than
for
non-grammar-related
association
measures
,
as
the
former
necessitate
parsed
input
)
.
Based
on
existing
results
for
named
entity
coref-erence
,
a
hypothetical
coreference
resolver
combining
our
information
sources
with
a
perfect
detector
for
discourse-new
mentions
would
be
able
to
achieve
a
precision
of
88
%
and
a
recall
of
83
%
considering
all
full
noun
phrases
(
i.e.
,
including
names
,
but
not
pronouns
)
.
This
is
both
much
higher
than
state-of-the
art
results
for
the
same
data
set
(
Versley
,
2006
,
gets
62
%
precision
and
70
%
recall
)
,
but
such
accuracy
may
be
very
difficult
to
achieve
in
practice
,
as
perfect
(
or
even
near-perfect
)
discourse-new
detection
does
not
seem
to
achievable
in
the
near
future
.
Preliminary
experiments
show
that
the
integration
of
pattern-based
information
leads
to
an
increase
in
recall
of
0.6
%
for
the
whole
system
(
or
46
%
more
coreferent
bridging
cases
)
,
but
the
integration
of
distributional
similarity
(
loosely
based
on
the
approach
by
Gasperin
and
Vieira
)
does
not
lead
7Thanks
to
Tonio
Wandmacher
for
pointing
this
out
to
me
at
GLDV'07
.
to
a
noticeable
improvement
over
GermaNet
alone
;
in
isolation
,
the
distributional
similarity
information
did
improve
the
recall
,
albeit
less
than
information
from
GermaNet
did
.
The
fact
that
only
a
small
fraction
of
the
achievable
recall
gain
is
currently
attained
seems
to
suggest
that
better
identification
of
discourse-old
mentions
could
potentially
lead
to
larger
improvements
.
It
also
seems
that
firstly
,
it
makes
more
sense
to
combine
information
sources
that
cover
different
relations
(
e.g.
GermaNet
for
hypernymy
and
synonymy
and
the
pattern-based
approach
for
instance
relations
)
than
those
that
yield
independent
evidence
for
the
same
relation
(
s
)
,
as
GermaNet
and
the
Gasperin
and
Vieira
approach
do
for
(
near
-
)
synonymy
;
and
secondly
,
that
good
precision
is
especially
important
in
the
context
of
integrating
antecedent
selection
and
discourse-new
identification
,
which
means
that
the
finer
view
that
we
get
using
antecedent
selection
experiments
(
compared
to
direct
use
in
a
coreference
resolver
)
is
indeed
helpful
.
Acknowledgements
I
am
very
grateful
to
Sabine
Schulte
im
Walde
,
Piklu
Gupta
and
Sandra
Kubler
for
useful
criticism
of
an
earlier
version
,
and
to
Simone
Ponzetto
and
Michael
Strube
for
feedback
on
a
talk
related
to
this
paper
.
The
research
reported
in
this
paper
was
supported
by
the
Deutsche
Forschungsgemeinschaft
(
DFG
)
as
part
of
Collaborative
Research
Centre
(
Sonderforschungsbereich
)
441
"
Linguistic
Data
Structures
"
.
