A
new
approach
to
large-scale
information
extraction
exploits
both
Web
documents
and
query
logs
to
acquire
thousands
of
open-domain
classes
of
instances
,
along
with
relevant
sets
of
open-domain
class
attributes
at
precision
levels
previously
obtained
only
on
small-scale
,
manually-assembled
classes
.
1
Introduction
Current
methods
for
large-scale
information
extraction
take
advantage
of
unstructured
text
available
from
either
Web
documents
(
Banko
et
al.
,
2007
;
Snow
et
al.
,
2006
)
or
,
more
recently
,
logs
of
Web
search
queries
(
Pa
§
ca
,
2007
)
to
acquire
useful
knowledge
with
minimal
supervision
.
Given
a
manually-specified
target
attribute
(
e.g.
,
birth
years
for
people
)
and
starting
from
as
few
as
10
seed
facts
such
as
(
e.g.
,
John
Lennon
,
1941
)
,
as
many
as
a
million
facts
of
the
same
type
can
be
derived
from
unstructured
text
within
Web
documents
(
Pa
§
ca
et
al.
,
2006
)
.
Similarly
,
given
a
manually-specified
target
class
(
e.g.
,
Drug
)
with
its
instances
(
e.g.
,
Vi-codin
and
Xanax
)
and
starting
from
as
few
as
5
seed
attributes
(
e.g.
,
side
effects
and
maximum
dose
for
Drug
)
,
other
relevant
attributes
can
be
extracted
for
the
same
class
from
query
logs
(
Pa
§
ca
,
2007
)
.
These
and
other
previous
methods
require
the
manual
specification
of
the
input
classes
of
instances
before
any
knowledge
(
e.g.
,
facts
or
attributes
)
can
be
acquired
for
those
classes
.
'
Contributions
made
during
an
internship
at
Google
.
The
extraction
method
introduced
in
this
paper
mines
a
collection
of
Web
search
queries
and
a
collection
of
Web
documents
to
acquire
open-domain
classes
in
the
form
of
instance
sets
(
e.g.
,
{
whales
,
seals
,
dolphins
,
sea
lions
,
.
.
.
}
)
associated
with
class
labels
(
e.g.
,
marine
animals
)
,
as
well
as
large
sets
ofopen-domain
attributes
for
each
class
(
e.g.
,
circulatory
system
,
life
cycle
,
evolution
,
food
chain
and
scientific
name
for
the
class
marine
animals
)
.
In
this
light
,
the
contributions
of
this
paper
are
fourfold
.
First
,
instead
of
separately
addressing
the
tasks
of
collecting
unlabeled
sets
of
instances
(
Lin
,
1998
)
,
assigning
appropriate
class
labels
to
a
given
set
of
instances
(
Pantel
and
Ravichandran
,
2004
)
,
and
identifying
relevant
attributes
for
a
given
set
of
classes
(
Pa
§
ca
,
2007
)
,
our
integrated
method
from
Section
2
enables
the
simultaneous
extraction
of
class
instances
,
associated
labels
and
attributes
.
Second
,
by
exploiting
the
contents
of
query
logs
during
the
extraction
of
labeled
classes
of
instances
from
Web
documents
,
we
acquire
thousands
(
4,583
,
to
be
exact
)
of
open-domain
classes
covering
a
wide
range
of
topics
and
domains
.
The
accuracy
reported
in
Section
3.2
exceeds
80
%
for
both
instance
sets
and
class
labels
,
although
the
extraction
of
classes
requires
a
remarkably
small
amount
of
supervision
,
in
the
form
of
only
a
few
commonly-used
Is-A
extraction
patterns
.
Third
,
we
conduct
the
first
study
in
extracting
attributes
for
thousands
of
open-domain
,
automatically-acquired
classes
,
at
precision
levels
over
70
%
at
rank
10
,
and
67
%
at
rank
20
as
described
in
Section
3.3
.
The
amount
of
supervision
is
limited
to
five
seed
attributes
provided
for
only
one
reference
class
.
In
comparison
,
the
largest
previous
Knowledge
extracted
from
documents
and
queries
Open-domain
labeled
classes
of
instances
amino
acids
=
{
phenylalanine
,
l-cysteine
,
tryptophan
,
glutamic
acid
,
lysine
,
thr
,
ornithine
,
valine
,
serine
,
isoleucine
,
aspartic
acid
,
aspartate
,
taurine
,
histidine
,
.
.
.
}
marine
animals
=
{
whales
,
seals
,
dolphins
,
turtles
,
sea
lions
,
fishes
,
penguins
,
squids
,
pacific
walrus
,
aquatic
birds
,
comb
jellies
,
starfish
,
florida
manatees
,
walruses
,
.
.
.
}
zoonotic
diseases
=
{
rabies
,
west
nile
virus
,
leptospirosis
,
brucellosis
,
lyme
disease
,
cat
scratch
fever
,
foot
and
mouth
disease
,
venezuelan
equine
encephalitis
,
.
.
.
}
Open-domain
class
attributes_f_
amino
acids
:
[
titration
curve
,
molecular
formula
,
isoelectric
point
,
density
,
extinction
coefficient
,
pi
,
food
sources
,
molecular
weight
,
pka
values
,
.
.
.
]
marine
animals
:
[
circulatory
system
,
life
cycle
,
evolution
,
food
chain
,
eyesight
,
scientific
name
,
skeleton
,
digestion
,
gestation
period
,
reproduction
,
taxonomy
,
.
.
.
]
movies
:
[
opening
song
,
cast
,
characters
,
actors
,
film
review
,
movie
script
,
symbolism
,
special
effects
,
soundboards
,
history
,
screenplay
,
director
,
.
.
.
]
zoonotic
diseases
:
[
scientific
name
,
causative
agent
,
mode
of
transmission
,
life
cycle
,
pathology
,
meaning
,
prognosis
,
incubation
period
,
symptoms
,
.
.
.
]
Figure
1
:
Overview
of
weakly-supervised
extraction
of
class
instances
,
class
labels
and
class
attributes
from
Web
documents
and
query
logs
study
in
attribute
extraction
reports
results
on
a
set
of
40
manually-assembled
classes
,
and
requires
five
seed
attributes
to
be
provided
as
input
for
each
class
.
Fourth
,
we
introduce
the
first
approach
to
information
extraction
from
a
combination
of
both
Web
documents
and
search
query
logs
,
to
extract
open-domain
knowledge
that
is
expected
to
be
suitable
for
later
use
.
In
contrast
,
the
textual
data
sources
used
in
previous
studies
in
large-scale
information
extraction
are
either
Web
documents
(
Mooney
and
Bunescu
,
2005
;
Banko
et
al.
,
2007
)
or
,
recently
,
query
logs
(
Pa
§
ca
,
2007
)
,
but
not
both
.
2
Extraction
from
Documents
and
Queries
2.1
Open-Domain
Labeled
Classes
of
Instances
Figure
1
provides
an
overview
of
how
Web
documents
and
queries
are
used
together
to
acquire
open-domain
,
labeled
classes
ofinstances
(
phase
(
1
)
in
the
figure
)
;
and
to
acquire
attributes
that
capture
quantifiable
properties
of
those
classes
,
by
mining
query
logs
based
on
the
class
instances
acquired
from
the
documents
,
while
guiding
the
extraction
based
on
a
few
attributes
provided
as
seed
examples
(
phase
(
2
)
)
.
As
described
in
Figure
2
,
the
algorithm
for
deriving
labeled
sets
of
class
instances
starts
with
the
acquisition
of
candidate
pairs
{
ME
}
of
a
class
label
and
an
instance
,
by
applying
a
few
extraction
patterns
to
unstructured
text
within
Web
documents
{
D
}
,
while
guiding
the
extraction
by
the
contents
of
query
logs
{
Q
}
(
Step
1
in
Figure
2
)
.
This
is
fol
-
04
.
For
each
cluster
of
phrases
s
in
{
s
}
05
.
06
.
07
.
08
.
09
.
10
.
11
.
12
.
13
.
14
.
For
each
query
Q
of
S
Figure
2
:
Acquisition
of
labeled
sets
of
class
instances
lowed
by
the
generation
of
unlabeled
clusters
{
S
}
of
distributionally
similar
queries
,
by
clustering
vectors
of
contextual
features
collected
around
the
occurrences
of
queries
{
Q
}
within
documents
{
D
}
(
Steps
2
and
3
)
.
Finally
,
the
intermediate
data
{
ME
}
and
{
S
}
is
merged
and
filtered
into
smaller
,
more
accurate
labeled
sets
of
instances
(
Steps
4
through
15
)
.
(
[
.
.
]
C
[
such
as
|
including
]
I
[
and
|
,
|
.
]
)
,
where
I
is
a
potential
instance
(
e.g.
,
Venezuelan
equine
encephalitis
)
and
C
is
a
potential
class
label
for
the
instance
(
e.g.
,
zoonotic
diseases
)
,
for
example
in
the
sentence
:
"
The
expansion
of
the
farms
increased
the
spread
ofzoonotic
diseases
such
as
Venezuelan
equine
encephalitis
[
.
.
]
"
.
During
matching
,
all
string
comparisons
are
case-insensitive
.
In
order
for
a
pattern
to
match
a
sentence
,
two
conditions
must
be
met
.
First
,
the
class
label
C
from
the
sentence
must
be
a
non-recursive
noun
phrase
whose
last
component
is
a
plural-form
noun
(
e.g.
,
zoonotic
diseases
in
the
above
sentence
)
.
Second
,
the
instance
I
from
the
sentence
must
also
occur
as
a
complete
query
somewhere
in
the
query
logs
{
Q
}
,
that
is
,
a
query
containing
the
instance
and
nothing
else
.
This
heuristic
acknowledges
the
difficulty
of
pinpointing
complex
entities
within
documents
(
Downey
et
al.
,
2007
)
,
and
embodies
the
hypothesis
that
,
if
an
instance
is
prominent
,
Web
search
users
will
eventually
ask
about
it
.
In
Steps
4
through
14
from
Figure
2
,
each
cluster
is
inspected
by
scanning
all
labels
attached
to
one
or
more
queries
from
the
cluster
.
For
each
label
CS
,
if
a
)
{
ME
}
indicates
that
a
large
number
of
all
queries
from
the
cluster
are
attached
to
the
label
(
as
controlled
by
the
parameter
J
in
Step
12
)
;
and
b
)
those
queries
are
a
significant
portion
of
all
queries
from
all
clusters
attached
to
the
same
label
in
{
ME
}
(
as
controlled
by
the
parameter
K
in
Step
13
)
,
then
the
label
CS
and
each
query
with
that
label
are
stored
in
the
output
pairs
{
&lt;
C
,
I
&gt;
}
(
Steps
13
and
14
)
.
The
parameters
J
and
K
can
be
used
to
emphasize
precision
(
higher
J
and
lower
K
)
or
recall
(
lower
J
and
higher
K
)
.
The
resulting
pairs
of
an
instance
and
a
class
label
are
arranged
into
sets
of
class
instances
(
e.g.
,
{
rabies
,
westnile
virus
,
leptospirosis
,
.
.
.
}
)
,
each
associated
with
a
class
label
(
e.g.
,
zoonotic
diseases
)
,
and
returned
in
Step
15
.
2.2
Open-Domain
Class
Attributes
The
labeled
classes
of
instances
collected
automatically
from
Web
documents
are
passed
as
input
to
phase
(
2
)
from
Figure
1
,
which
acquires
class
attributes
by
mining
a
collection
of
Web
search
queries
.
The
attributes
capture
properties
that
are
relevant
to
the
class
.
The
extraction
of
attributes
exploits
the
set
of
class
instances
rather
than
the
associated
class
label
,
and
consists
of
four
stages
:
1
)
identification
of
a
noisy
pool
of
candidate
attributes
,
as
remainders
of
queries
that
also
contain
one
of
the
class
instances
.
In
the
case
of
the
class
movies
,
whose
instances
include
jay
and
silent
bob
strike
back
and
kill
bill
,
the
query
"
cast
jay
and
silent
bob
strike
back
"
produces
the
candidate
attribute
cast
;
2
)
construction
ofinternal
search-signature
vector
representations
for
each
candidate
attribute
,
based
on
queries
(
e.g.
,
"
cast
selection
for
kill
bill
"
)
that
contain
a
candidate
attribute
(
cast
)
and
a
class
instance
(
kill
bill
)
.
These
vectors
consist
of
counts
tied
to
the
frequency
with
which
an
attribute
occurs
with
a
given
"
templatized
"
query
.
The
latter
replaces
specific
attributes
and
instances
from
the
query
with
common
placeholders
,
e.g.
,
"
Xfor
Y
"
;
3
)
construction
of
a
reference
internal
search-signature
vector
representation
for
a
small
set
of
seed
attributes
provided
as
input
.
A
reference
vector
is
the
normalized
sum
of
the
individual
vectors
corresponding
to
the
seed
attributes
;
4
)
ranking
of
candidate
attributes
with
respect
to
each
class
(
e.g.
,
movies
)
,
by
computing
similarity
scores
between
their
individual
vector
representations
and
the
reference
vector
of
the
seed
attributes
.
The
result
of
the
four
stages
is
a
ranked
list
of
attributes
(
e.g.
,
[
opening
song
,
cast
,
characters
,
.
.
.
]
)
for
each
class
(
e.g.
,
movies
)
.
In
a
departure
from
previous
work
,
the
instances
of
each
input
class
are
automatically
generated
as
described
earlier
,
rather
than
manually
assembled
.
Furthermore
,
the
amount
of
supervision
is
limited
to
seed
attributes
being
provided
for
only
one
of
the
classes
,
whereas
(
Pa
§
ca
,
2007
)
requires
seed
attributes
for
each
class
.
To
this
effect
,
the
extraction
includes
modifications
such
that
only
one
reference
vector
is
constructed
internally
from
the
seed
attributes
during
the
third
stage
,
rather
one
such
vector
for
each
class
in
(
Pa
§
ca
,
2007
)
;
and
similarity
scores
are
computed
cross-class
by
comparing
vector
representations
of
individual
candidate
attributes
against
the
only
reference
vector
available
during
the
fourth
stage
,
rather
than
with
respect
to
the
reference
vector
of
each
class
in
(
Pa
§
ca
,
2007
)
.
3
Evaluation
The
acquisition
of
open-domain
knowledge
,
in
the
form
of
class
instances
,
labels
and
attributes
,
relies
on
unstructured
text
available
within
Web
documents
maintained
by
,
and
search
queries
submitted
to
,
the
Google
search
engine
.
The
collection
of
queries
is
a
random
sample
of
fully-anonymized
queries
in
English
submitted
by
Web
users
in
2006
.
The
sample
contains
approximately
50
million
unique
queries
.
Each
query
is
Found
in
WordNet
?
Pet
.
Examples
Yes
(
original
)
baseball
players
,
endangered
species
Yes
(
removal
)
Caribbean
countries
,
fundamental
rights
agrochemicals
,
celebs
,
handhelds
,
mangas
Table
1
:
Class
labels
found
in
WordNet
in
original
form
,
or
found
in
WordNet
after
removal
of
leading
words
,
or
not
found
in
WordNet
at
all
accompanied
by
its
frequency
of
occurrence
in
the
logs
.
The
document
collection
consists
of
approximately
100
million
Web
documents
in
English
,
as
available
in
a
Web
repository
snapshot
from
2006
.
The
textual
portion
of
the
documents
is
cleaned
of
html
,
tokenized
,
split
into
sentences
and
part-of-speech
tagged
using
the
TnT
tagger
(
Brants
,
2000
)
.
3.2
Evaluation
of
Labeled
Classes
of
Instances
Extraction
Parameters
:
The
set
of
instances
that
can
be
potentially
acquired
by
the
extraction
algorithm
described
in
Section
2.1
is
heuristically
limited
to
the
top
five
million
queries
with
the
highest
frequency
within
the
input
query
logs
.
In
the
extracted
data
,
a
class
label
(
e.g.
,
search
engines
)
is
associated
with
one
or
more
instances
(
e.g.
,
google
)
.
Similarly
,
an
instance
(
e.g.
,
google
)
is
associated
with
one
or
more
class
labels
(
e.g.
,
search
engines
and
internet
search
engines
)
.
The
values
chosen
for
the
weighting
parameters
J
and
K
from
Section
2.1
are
0.01
and
30
respectively
.
After
discarding
classes
with
fewer
than
25
instances
,
the
extracted
set
of
classes
consists
of
4,583
class
labels
,
each
of
them
associated
with
25
to
7,967
instances
,
with
an
average
of
189
instances
per
class
.
Accuracy
of
Class
Labels
:
Built
over
many
years
of
manual
construction
efforts
,
lexical
gold
standards
such
as
WordNet
(
Fellbaum
,
1998
)
provide
wide-coverage
upper
ontologies
of
the
English
language
.
Built-in
morphological
normalization
routines
make
it
straightforward
to
verify
whether
a
class
label
(
e.g.
,
faculty
members
)
exists
as
a
concept
in
WordNet
(
e.g.
,
faculty
member
)
.
When
an
extracted
label
(
e.g.
,
central
nervous
system
disorders
)
is
not
found
in
WordNet
,
it
is
looked
up
again
after
iteratively
removing
its
leading
words
(
e.g.
,
nervous
system
dis
-
Parent
in
WordNet
american
composers
=
{
aaron
Copland
,
eric
ewazen
,
george
gershwin
,
.
.
.
}
composers
appliances
area
hospitals
=
{
carolinas
medical
center
,
nyack
hospital
,
.
.
.
}
hospitals
multiple
languages
=
{
chuukese
,
ladino
,
mandarin
,
us
english
,
.
.
.
}
languages
Table
2
:
Correctness
judgments
for
extracted
classes
whose
class
labels
are
found
in
WordNet
only
after
removal
of
their
leading
words
(
C
=
Correctness
,
Y
=
correct
,
S
=
subjectively
correct
,
N
=
incorrect
)
orders
,
system
disorders
and
disorders
)
.
As
shown
in
Table
1
,
less
than
half
of
the
4,583
extracted
class
labels
(
e.g.
,
baseball
players
)
are
found
in
their
original
forms
in
WordNet
.
The
majority
of
the
class
labels
(
2,614
out
of
4,583
)
can
be
found
in
WordNet
only
after
removal
of
one
or
more
leading
words
(
e.g.
,
caribbean
countries
)
,
which
suggests
that
many
of
the
class
labels
correspond
to
finer-grained
,
automatically-extracted
concepts
that
are
not
available
in
the
manually-built
WordNet
.
To
test
whether
that
is
the
case
,
a
random
sample
of
200
class
labels
,
out
of
the
2,614
labels
found
to
be
potentially-useful
specific
concepts
,
are
manually
annotated
as
correct
,
subjectively
correct
or
incorrect
,
as
shown
in
Table
2
.
A
class
label
is
:
correct
,
if
it
captures
a
relevant
concept
although
it
could
not
be
found
in
WordNet
;
subjectively
correct
,
if
it
is
relevant
not
in
general
but
only
in
a
particular
context
,
either
from
a
subjective
viewpoint
(
e.g.
,
modern
appliances
)
,
or
relative
to
a
particular
temporal
anchor
(
e.g.
,
current
players
)
,
or
in
connection
to
a
particular
geographical
area
(
e.g.
,
area
hospitals
)
;
or
incorrect
,
if
it
does
not
capture
any
useful
concept
(
e.g.
,
multiple
languages
)
.
The
manual
analysis
of
the
sample
of
200
class
labels
indicates
that
154
(
77
%
)
are
relevant
concepts
and
27
(
13.5
%
)
are
subjectively
relevant
concepts
,
for
a
total
of
181
(
90.5
%
)
relevant
concepts
,
whereas
19
(
9.5
%
)
of
the
labels
are
incorrect
.
It
is
worth
emphasizing
the
importance
of
automatically-collected
classes
judged
as
relevant
and
not
present
in
WordNet
:
caribbean
countries
,
computer
manufacturers
,
entertainment
companies
,
market
research
firms
are
arguably
very
useful
and
should
probably
be
considered
as
part
of
Class
Label
Size
of
Instance
Sets
E
(
Extracted
)
AircraftModel
NationalPark
nba
teams
BasicFood
Newspaper
newspapers
CarModel
car
models
painters
CartoonChar
cartoon
characters
ProgLanguage
programming
languages
CellPhoneModel
cell
phones
Religion
religions
ChemicalElem
chemicals
river
systems
SearchEngine
search
engines
companies
constellations
countries
Skyscraper
Currency
currencies
SoccerClub
football
clubs
DigitalCamera
digital
cameras
SportEvent
sports
events
diseases
stadiums
TerroristGroup
terrorist
groups
treaties
University
universities
holidays
VideoGame
video
games
Hurricane
Mountain
mountains
WorldWarBattle
Table
3
:
Comparison
between
manually-assembled
instance
sets
of
gold-standard
classes
(
M
)
and
instance
sets
of
automatically-extracted
classes
(
E
)
.
Each
gold-standard
class
(
M
)
was
manually
mapped
into
an
extracted
class
(
E
)
,
unless
no
relevant
mapping
was
found
.
Ratios
(
M^E
)
are
shown
as
percentages
any
refinements
to
hand-built
hierarchies
,
including
any
future
extensions
of
WordNet
.
Accuracy
of
Class
Instances
:
The
computation
of
the
precision
of
the
extracted
instances
(
e.g.
,
fifth
element
and
kill
bill
for
the
class
label
movies
)
relies
on
manual
inspection
of
all
instances
associated
to
a
sample
of
the
extracted
class
labels
.
Rather
than
inspecting
a
random
sample
of
classes
,
the
evaluation
validates
the
results
against
a
reference
set
of
40
gold-standard
classes
that
were
manually
assembled
as
part
of
previous
work
(
Pa
§
ca
,
2007
)
.
A
class
from
the
gold
standard
consists
of
a
manually-created
class
label
(
e.g.
,
AircraftModel
)
associated
with
a
manually-assembled
,
and
therefore
high-precision
,
set
of
representative
instances
of
the
class
.
To
evaluate
the
precision
of
the
extracted
instances
,
the
manual
label
of
each
gold-standard
class
(
e.g.
,
SearchEngine
)
is
mapped
into
a
class
label
extracted
from
text
(
e.g.
,
search
engines
)
.
As
shown
in
the
first
two
columns
of
Table
3
,
the
mapping
into
extracted
class
labels
succeeds
for
37
of
the
40
gold-standard
classes
.
28
of
the
37
mappings
involve
linking
an
abstract
class
label
(
e.g.
,
SearchEngine
)
with
the
corresponding
plural
forms
among
the
extracted
class
labels
(
e.g.
,
search
engines
)
.
The
remaining
9
mappings
link
a
manual
class
label
with
either
an
equivalent
extracted
class
label
(
e.g.
,
Soc-cerClub
with
football
clubs
)
,
or
a
strongly-related
class
label
(
e.g.
,
NationalPark
with
parks
)
.
No
mapping
is
found
for
3
out
of
the
40
classes
,
namely
Air-craftModel
,
Hurricane
and
Skyscraper
,
which
are
therefore
removed
from
consideration
.
The
sizes
of
the
instance
sets
available
for
each
class
in
the
gold
standard
are
compared
in
the
third
through
fifth
columns
of
Table
3
.
In
the
table
,
M
stands
for
manually-assembled
instance
sets
,
and
E
for
automatically-extracted
instance
sets
.
For
example
,
the
gold-standard
class
SearchEngine
contains
25
manually-collected
instances
,
while
the
parallel
class
label
search
engines
contains
133
automatically-extracted
instances
.
The
fifth
column
shows
the
percentage
ofmanually-collected
instances
(
M
)
that
are
also
extracted
automatically
(
E
)
.
In
the
case
of
the
class
SearchEngine
,
16
of
the
25
manually-collected
instances
are
among
the
133
automatically-extracted
instances
of
the
same
class
,
Examples
of
Attributes
investors
:
investment
strategies
religious
leaders
:
coat
of
arms
designers
:
Stephanie
Class
:
Holiday
Table
4
:
Labels
for
assessing
attribute
correctness
which
corresponds
to
a
relative
coverage
of
64
%
of
the
manually-collected
instance
set
.
Some
instances
may
occur
within
the
manually-collected
set
but
not
the
automatically-extracted
set
(
e.g.
,
zoominfo
and
brainbost
for
the
class
SearchEngine
)
or
,
more
frequently
,
vice-versa
(
e.g.
,
surfwax
,
blinkx
,
entireweb
,
web
wombat
,
exalead
etc.
)
.
Overall
,
the
relative
coverage
of
automatically-extracted
instance
sets
with
respect
to
manually-collected
instance
sets
is
26.89
%
,
as
an
average
over
the
37
gold-standard
classes
.
More
significantly
,
the
size
advantage
of
automatically-extracted
instance
sets
is
not
the
undesirable
result
of
those
sets
containing
many
spurious
instances
.
Indeed
,
the
manual
inspection
of
the
automatically-extracted
instances
sets
indicates
an
average
accuracy
of
79.3
%
over
the
37
gold-standard
classes
retained
in
the
experiments
.
To
summarize
,
the
method
proposed
in
this
paper
acquires
open-domain
classes
from
unstructured
text
of
arbitrary
quality
,
without
a-priori
restrictions
to
specific
domains
of
interest
and
with
virtually
no
supervision
(
except
for
the
ubiquitous
Is-A
extraction
patterns
)
,
at
accuracy
levels
of
around
90
%
for
class
labels
and
80
%
for
class
instances
.
3.3
Evaluation
of
Class
Attributes
Extraction
Parameters
:
Given
a
target
class
specified
as
a
set
of
instances
and
a
set
of
five
seed
attributes
for
a
class
(
e.g.
,
{
quality
,
speed
,
number
of
users
,
market
share
,
reliability
}
for
SearchEngine
)
,
the
method
described
in
Section
2.2
extracts
ranked
lists
of
class
attributes
from
the
input
query
logs
.
Internally
,
the
ranking
uses
Jensen-Shannon
(
Lee
,
1999
)
to
compute
similarity
scores
between
internal
representations
of
seed
attributes
,
on
one
hand
,
and
each
of
the
candidate
attributes
,
on
the
other
hand
.
Evaluation
Procedure
:
To
remove
any
possible
bias
towards
higher-ranked
attributes
during
the
assessment
of
class
attributes
,
the
ranked
lists
of
attributes
to
be
evaluated
are
sorted
alphabetically
into
a
merged
list
.
Each
attribute
of
the
merged
list
is
autottia
icall
a
ext
nbled
ins
acted
ins
tb
edlins
actedlins
Class
:
Mountain
Class
:
Average-Class
auttoatnau
tbledlins
acted
ins
aanncceess
Figure
3
:
Accuracy
of
attributes
extracted
based
on
manually
assembled
,
gold
standard
(
M
)
vs.
automatically
extracted
(
E
)
instance
sets
,
for
a
few
target
classes
(
leftmost
graphs
)
and
as
an
average
over
all
(
37
)
target
classes
(
rightmost
graphs
)
.
Seed
attributes
are
provided
as
input
for
each
target
class
(
top
graphs
)
,
or
for
only
one
target
class
(
bottom
graphs
)
manually
assigned
a
correctness
label
within
its
respective
class
.
An
attribute
is
vital
if
it
must
be
present
in
an
ideal
list
of
attributes
of
the
class
;
okay
if
it
provides
useful
but
non-essential
information
;
and
wrong
if
it
is
incorrect
.
To
compute
the
overall
precision
score
over
a
ranked
list
of
extracted
attributes
,
the
correctness
labels
are
converted
to
numeric
values
as
shown
in
Table
4
.
Precision
at
some
rank
N
in
the
list
is
thus
measured
as
the
sum
of
the
assigned
values
of
the
first
N
candidate
attributes
,
divided
by
N.
Accuracy
of
Class
Attributes
:
Figure
3
plots
precision
values
for
ranks
1
through
50
of
the
lists
of
attributes
extracted
through
several
runs
over
the
37
gold-standard
classes
described
in
the
previous
section
.
The
runs
correspond
to
different
amounts
of
supervision
,
specified
through
a
particular
choice
in
the
number
of
seed
attributes
,
and
in
the
source
of
instances
passed
as
input
to
the
system
:
manually
assembled
instances
automatically
extracted
instances
Precision
Top
Ten
Extracted
Attributes
Class
Label
=
{
Set
of
Instances
}
accounting
systems
=
{
flexcube
,
myob
,
oracle
fi
nancials
,
peachtree
accounting
,
sybiz
,
.
.
.
}
overview
,
architecture
,
interview
questions
,
free
downloads
,
Canadian
version
,
passwords
,
modules
,
crystal
reports
,
property
management
,
free
trial
antimicrobials
=
{
azithromycin
,
chloramphenicol
,
fusidic
acid
,
quinolones
,
sulfa
drugs
,
.
.
.
}
chemical
formula
,
chemical
structure
,
history
,
invention
,
inventor
,
defi
nition
,
mechanism
of
action
,
side-effects
,
uses
,
shelf
life
civilizations
=
{
ancient
greece
,
chaldeans
,
etruscans
,
inca
indians
,
roman
republic
,
.
.
.
}
social
pyramid
,
climate
,
geography
,
flag
,
population
,
social
structure
,
natural
resources
,
family
life
,
god
,
goddesses
digestive
system
,
evolution
,
domestication
,
gestation
period
,
scientifi
c
name
,
adaptations
,
coloring
pages
,
p
*
*
,
body
parts
,
selective
breeding
forages
=
{
alsike
clover
,
rye
grass
,
tall
fescue
,
sericea
lespedeza
,
.
.
.
}
types
,
picture
,
weed
control
,
planting
,
uses
,
information
,
herbicide
,
germination
,
care
,
fertilizer
Table
5
:
Precision
of
attributes
extracted
for
a
sample
of
25
classes
.
Seed
attributes
are
provided
for
only
one
class
.
•
source
of
input
instance
sets
:
the
instance
sets
for
each
class
are
either
manually
collected
(
M
from
Table
3
)
,
or
automatically
extracted
(
E
from
Table
3
)
.
The
choices
correspond
to
the
two
curves
plotted
in
each
graph
in
Figure
3
.
The
graphs
in
Figure
3
show
the
precision
over
individual
target
classes
(
leftmost
graphs
)
,
and
as
an
average
over
all
37
classes
(
rightmost
graphs
)
.
As
expected
,
the
precision
of
the
extracted
attributes
as
an
average
over
all
classes
is
best
when
the
input
instance
sets
are
hand-picked
(
M
)
,
as
opposed
to
automatically
extracted
(
E
)
.
However
,
the
loss
of
precision
from
M
to
E
is
small
at
all
measured
ranks
.
Table
5
offers
an
alternative
view
on
the
quality
of
the
attributes
extracted
for
a
random
sample
of
25
classes
out
of
the
larger
set
of
4,583
classes
acquired
from
text
.
The
25
classes
are
passed
as
input
for
attribute
extraction
without
modifications
.
In
particular
,
the
instance
sets
are
not
manually
post-filtered
or
otherwise
changed
in
any
way
.
To
keep
the
time
required
to
judge
the
correctness
of
all
extracted
attributes
within
reasonable
limits
,
the
evaluation
considers
only
the
top
20
(
rather
than
50
)
attributes
extracted
per
class
.
As
shown
in
Table
5
,
the
method
proposed
in
this
paper
acquires
attributes
for
automatically-extracted
,
open-domain
classes
,
without
a-priori
restrictions
to
specific
domains
of
interest
and
relying
on
only
five
seed
attributes
specified
for
only
one
class
,
at
accuracy
levels
reaching
70
%
at
rank
10
,
and
67
%
at
rank
20
.
4
Related
Work
4.1
Acquisition
of
Classes
of
Instances
Although
some
researchers
focus
on
re-organizing
or
extending
classes
of
instances
already
available
explicitly
within
manually-built
resources
such
as
Wikipedia
(
Ponzetto
and
Strube
,
2007
)
or
Word-Net
(
Snow
et
al.
,
2006
)
or
both
(
Suchanek
et
al.
,
2007
)
,
a
large
body
of
previous
work
focuses
on
compiling
sets
of
instances
,
not
necessarily
labeled
,
from
unstructured
text
.
The
extraction
proceeds
either
iteratively
by
starting
from
a
few
seed
extraction
rules
(
Collins
and
Singer
,
1999
)
,
or
by
mining
named
entities
from
comparable
news
articles
(
Shinyama
and
Sekine
,
2004
)
or
from
multilingual
corpora
(
Klementiev
and
Roth
,
2006
)
.
A
bootstrapping
method
(
Riloff
and
Jones
,
1999
)
cautiously
grows
very
small
seed
sets
of
five
instances
of
the
same
class
,
to
fewer
than
300
items
after
50
consecutive
iterations
,
with
a
final
precision
varying
between
46
%
and
76
%
depending
on
the
type
of
semantic
lexicon
.
Experimental
results
from
(
Feldman
and
Rosenfeld
,
2006
)
indicate
that
named
entity
recognizers
can
boost
the
performance
of
weakly
supervised
extraction
of
class
instances
,
but
only
for
a
few
coarse-grained
types
such
as
Person
and
only
if
they
are
simpler
to
recognize
in
text
(
Feldman
and
Rosenfeld
,
2006
)
.
In
(
Cafarella
et
al.
,
2005
)
,
handcrafted
extraction
patterns
are
applied
to
a
collection
of
60
million
Web
documents
to
extract
instances
of
the
classes
Company
and
Country
.
Based
on
the
manual
evaluation
of
samples
of
extracted
instances
,
an
estimated
number
of
1,116
instances
of
Company
are
extracted
at
a
precision
score
of
90
%
.
In
comparison
,
the
approach
ofthis
paper
pursues
a
more
aggressive
goal
,
by
extracting
a
larger
and
more
diverse
number
of
labeled
classes
,
whose
instances
are
often
more
difficult
to
extract
than
country
names
and
most
company
names
,
at
precision
scores
of
almost
80
%
.
The
task
of
extracting
relevant
labels
to
describe
sets
of
documents
,
rather
than
sets
of
instances
,
is
explored
in
(
Treeratpituk
and
Callan
,
2006
)
.
Given
pre-existing
sets
of
instances
,
(
Pantel
and
Ravichan-dran
,
2004
)
investigates
the
task
of
acquiring
appropriate
class
labels
to
the
sets
from
unstructured
text
.
Various
class
labels
are
assigned
to
a
total
of
1,432
sets
of
instances
.
The
accuracy
of
the
class
labels
is
computed
over
a
sample
of
instances
,
by
manually
assessing
the
correctness
of
the
top
five
labels
returned
by
the
system
for
each
instance
.
The
resulting
mean
reciprocal
rank
of
77
%
gives
partial
credit
to
labels
of
an
evaluated
instance
,
even
if
only
the
fourth
or
fifth
assigned
labels
are
correct
.
Our
evaluation
of
the
accuracy
of
class
labels
is
stricter
,
as
it
considers
only
one
class
label
of
a
given
instance
at
a
time
,
rather
than
a
pool
of
the
best
candidate
labels
.
As
a
pre-requisite
to
extracting
relations
among
pairs
of
classes
,
the
method
described
in
(
Davidov
et
al.
,
2007
)
extracts
class
instances
from
unstructured
Web
documents
,
by
submitting
pairs
of
instances
as
queries
and
analyzing
the
contents
of
the
top
1,000
documents
returned
by
a
Web
search
engine
.
For
each
target
class
,
a
small
set
of
instances
must
be
provided
manually
as
seeds
.
As
such
,
the
method
can
be
applied
to
the
task
of
extracting
a
large
set
of
open-domain
classes
only
after
manually
enumerating
through
the
entire
set
of
target
classes
,
and
providing
seed
instances
for
each
.
Furthermore
,
no
attempt
is
made
to
extract
relevant
class
labels
for
the
sets
of
instances
.
Comparatively
,
the
open-domain
classes
extracted
in
our
paper
have
an
explicit
label
in
addition
to
the
sets
of
instances
,
and
do
not
require
identifying
the
range
of
the
target
classes
in
advance
,
or
providing
any
seed
instances
as
input
.
The
evaluation
methodology
is
also
quite
dif
-
ferent
,
as
the
instance
sets
acquired
based
on
the
input
seed
instances
in
(
Davidov
et
al.
,
2007
)
are
only
evaluated
for
three
hand-picked
classes
,
with
precision
scores
of
90
%
for
names
of
countries
,
87
%
for
fish
species
and
68
%
for
instances
of
constellations
.
Our
evaluation
of
the
accuracy
of
class
instances
is
again
stricter
,
since
the
evaluation
sample
is
larger
,
and
includes
more
varied
classes
,
whose
instances
are
sometimes
more
difficult
to
identify
in
text
.
4.2
Acquisition
of
Class
Attributes
Previous
work
on
the
automatic
acquisition
of
attributes
for
open-domain
classes
from
text
is
less
general
than
the
extraction
method
and
experiments
presented
in
our
paper
.
Indeed
,
previous
evaluations
were
restricted
to
small
sets
of
classes
(
forty
classes
in
(
Pa
§
ca
,
2007
)
)
,
whereas
our
evaluations
also
consider
a
random
,
more
diverse
sample
of
open-domain
classes
.
More
importantly
,
by
dropping
the
requirement
of
manually
providing
a
small
set
of
seed
attributes
for
each
target
class
,
and
relying
on
only
a
few
seed
attributes
specified
for
one
reference
class
,
we
harvest
class
attributes
without
the
need
of
first
determining
what
the
classes
should
be
,
what
instances
they
should
contain
,
and
from
which
resources
the
instances
should
be
collected
.
5
Conclusion
In
a
departure
from
previous
approaches
to
large-scale
information
extraction
from
unstructured
text
on
the
Web
,
this
paper
introduces
a
weakly-supervised
extraction
framework
for
mining
useful
knowledge
from
a
combination
of
both
documents
and
search
query
logs
.
In
evaluations
over
labeled
classes
of
instances
extracted
without
a-priori
restrictions
to
specific
domains
of
interest
and
with
very
little
supervision
,
the
accuracy
exceeds
90
%
for
class
labels
,
approaches
80
%
for
class
instances
,
and
exceeds
70
%
(
at
rank
10
)
and
67
%
(
at
rank
20
)
for
class
attributes
.
Current
work
aims
at
expanding
the
number
of
instances
within
each
class
while
retaining
similar
precision
levels
;
extracting
attributes
with
more
consistent
precision
scores
across
classes
from
different
domains
;
and
introducing
confidence
scores
in
attribute
extraction
,
allowing
for
the
detection
of
classes
for
which
it
is
unlikely
to
extract
large
numbers
of
useful
attributes
from
text
.
