This
paper
demonstrates
a
new
method
for
leveraging
free-text
annotations
to
infer
semantic
properties
of
documents
.
Free-text
annotations
are
becoming
increasingly
abundant
,
due
to
the
recent
dramatic
growth
in
semi-structured
,
user-generated
online
content
.
An
example
of
such
content
is
product
reviews
,
which
are
often
annotated
by
their
authors
with
pros
/
cons
keyphrases
such
as
"
a
real
bargain
"
or
"
good
value
.
"
To
exploit
such
noisy
annotations
,
we
simultaneously
find
a
hidden
paraphrase
structure
of
the
keyphrases
,
a
model
of
the
document
texts
,
and
the
underlying
semantic
properties
that
link
the
two
.
This
allows
us
to
predict
properties
of
unannotated
documents
.
Our
approach
is
implemented
as
a
hierarchical
Bayesian
model
with
joint
inference
,
which
increases
the
robustness
of
the
keyphrase
clustering
and
encourages
the
document
model
to
correlate
with
semantically
meaningful
properties
.
We
perform
several
evaluations
of
our
model
,
and
find
that
it
substantially
outperforms
alternative
approaches
.
1
Introduction
A
central
problem
in
language
understanding
is
transforming
raw
text
into
structured
representations
.
Learning-based
approaches
have
dramatically
increased
the
scope
and
robustness
of
this
type
of
automatic
language
processing
,
but
they
are
typically
dependent
on
large
expert-annotated
datasets
,
which
are
costly
to
produce
.
In
this
paper
,
we
show
how
novice-generated
free-text
annotations
available
online
can
be
leveraged
to
automatically
infer
document-level
semantic
properties
.
With
the
rapid
increase
of
online
content
created
by
end
users
,
noisy
free-text
annotations
have
pros
/
cons
:
great
nutritional
value
friendly
service
,
cleanliness
,
great
nutrition
.
.
.
pros
/
cons
:
a
bit
pricey
,
healthy
.
.
.
is
an
awesome
place
to
go
if
you
are
health
conscious
.
They
have
some
really
great
low
calorie
dishes
and
they
publish
the
calories
and
fat
grams
per
serving
.
Figure
1
:
Excerpts
from
online
restaurant
reviews
with
pros
/
cons
phrase
lists
.
Both
reviews
discuss
healthiness
,
but
use
different
keyphrases
.
become
widely
available
(
Vickery
and
WunschVincent
,
2007
;
Sterling
,
2005
)
.
For
example
,
consider
reviews
of
consumer
products
and
services
.
Often
,
such
reviews
are
annotated
with
keyphrase
lists
of
pros
and
cons
.
We
would
like
to
use
these
keyphrase
lists
as
training
labels
,
so
that
the
properties
of
unannotated
reviews
can
be
predicted
.
Having
such
a
system
would
facilitate
structured
access
and
summarization
of
this
data
.
However
,
novice-generated
keyphrase
annotations
are
incomplete
descriptions
of
their
corresponding
review
texts
.
Furthermore
,
they
lack
consistency
:
the
same
underlying
property
may
be
expressed
in
many
ways
,
e.g.
,
"
healthy
"
and
"
great
nutritional
value
"
(
see
Figure
1
)
.
To
take
advantage
of
such
noisy
labels
,
a
system
must
both
uncover
their
hidden
clustering
into
properties
,
and
learn
to
predict
these
properties
from
review
text
.
This
paper
presents
a
model
that
addresses
both
problems
simultaneously
.
We
assume
that
both
the
document
text
and
the
selection
of
keyphrases
are
governed
by
the
underlying
hidden
properties
of
the
document
.
Each
property
indexes
a
language
model
,
thus
allowing
documents
that
incorporate
the
same
property
to
share
similar
features
.
In
addition
,
each
keyphrase
is
associated
with
a
property
;
keyphrases
that
are
associated
with
the
same
property
should
have
similar
distributional
and
surface
features
.
We
link
these
two
ideas
in
a
joint
hierarchical
Bayesian
model
.
Keyphrases
are
clustered
based
on
their
distributional
and
lexical
properties
,
and
a
hidden
topic
model
is
applied
to
the
document
text
.
Crucially
,
the
keyphrase
clusters
and
document
topics
are
linked
,
and
inference
is
performed
jointly
.
This
increases
the
robustness
of
the
keyphrase
clustering
,
and
ensures
that
the
inferred
hidden
topics
are
indicative
of
salient
semantic
properties
.
Our
model
is
broadly
applicable
to
many
scenarios
where
documents
are
annotated
in
a
noisy
manner
.
In
this
work
,
we
apply
our
method
to
a
collection
of
reviews
in
two
categories
:
restaurants
and
cell
phones
.
The
training
data
consists
ofreview
text
and
the
associated
pros
/
cons
lists
.
We
then
evaluate
the
ability
of
our
model
to
predict
review
properties
when
the
pros
/
cons
list
is
hidden
.
Across
a
variety
of
evaluation
scenarios
,
our
algorithm
consistently
outperforms
alternative
strategies
by
a
wide
margin
.
2
Related
Work
Review
Analysis
Our
approach
relates
to
previous
work
on
property
extraction
from
reviews
(
Popescu
et
al.
,
2005
;
Hu
and
Liu
,
2004
;
Kim
and
Hovy
,
2006
)
.
These
methods
extract
lists
of
phrases
,
which
are
analogous
to
the
keyphrases
we
use
as
input
to
our
algorithm
.
However
,
our
approach
is
distinguished
in
two
ways
:
first
,
we
are
able
to
predict
keyphrases
beyond
those
that
appear
verbatim
in
the
text
.
Second
,
our
approach
learns
the
relationships
between
keyphrases
,
allowing
us
to
draw
direct
comparisons
between
reviews
.
Bayesian
Topic
Modeling
One
aspect
of
our
model
views
properties
as
distributions
over
words
in
the
document
.
This
approach
is
inspired
by
methods
in
the
topic
modeling
literature
,
such
as
Latent
Dirichlet
Allocation
(
LDA
)
(
Blei
et
al.
,
2003
)
,
where
topics
are
treated
as
hidden
variables
that
govern
the
distribution
of
words
in
a
text
.
Our
algorithm
extends
this
notion
by
biasing
the
induced
hidden
topics
toward
a
clustering
of
known
keyphrases
.
Tying
these
two
information
sources
together
enhances
the
robustness
of
the
hidden
topics
,
thereby
increasing
the
chance
that
the
induced
structure
corresponds
to
semantically
meaningful
properties
.
Recent
work
has
examined
coupling
topic
models
with
explicit
supervision
(
Blei
and
McAuliffe
,
2007
;
Titov
and
McDonald
,
2008
)
.
However
,
such
approaches
assume
that
the
documents
are
labeled
within
a
predefined
annotation
structure
,
e.g.
,
the
properties
of
food
,
ambiance
,
and
service
for
restaurants
.
In
contrast
,
we
address
free-text
annotations
created
by
end
users
,
without
known
semantic
properties
.
Rather
than
requiring
a
predefined
annotation
structure
,
our
model
infers
one
from
the
data
.
3
Problem
Formulation
We
formulate
our
problem
as
follows
.
We
assume
a
dataset
composed
of
documents
with
associated
keyphrases
.
Each
document
may
be
marked
with
multiple
keyphrases
that
express
unseen
semantic
properties
.
Across
the
entire
collection
,
several
keyphrases
may
express
the
same
property
.
The
keyphrases
are
also
incomplete
—
review
texts
often
express
properties
that
are
not
mentioned
in
their
keyphrases
.
At
training
time
,
our
model
has
access
to
both
text
and
keyphrases
;
at
test
time
,
the
goal
is
to
predict
the
properties
supported
by
a
previously
unseen
document
.
We
can
then
use
this
property
list
to
generate
an
appropriate
set
of
keyphrases
.
4
Model
Description
Our
approach
leverages
both
keyphrase
clustering
and
distributional
analysis
of
the
text
in
a
joint
,
hierarchical
Bayesian
model
.
Keyphrases
are
drawn
from
a
set
of
clusters
;
words
in
the
documents
are
drawn
from
language
models
indexed
by
a
set
of
topics
,
where
the
topics
correspond
to
the
keyphrase
clusters
.
Crucially
,
we
bias
the
assignment
of
hidden
topics
in
the
text
to
be
similar
to
the
topics
represented
by
the
keyphrases
of
the
document
,
but
we
permit
some
words
to
be
drawn
from
other
topics
not
represented
by
the
keyphrases
.
This
flexibility
in
the
coupling
allows
the
model
to
learn
effectively
in
the
presence
of
incomplete
keyphrase
annotations
,
while
still
encouraging
the
keyphrase
clustering
to
cohere
with
the
topics
supported
by
the
text
.
We
train
the
model
on
documents
annotated
with
keyphrases
.
During
training
,
we
learn
a
hidden
topic
model
from
the
text
;
each
topic
is
also
asso
-
keyphrase
cluster
model
keyphrase
cluster
assignment
keyphrase
similarity
values
document
keyphrases
document
keyphrase
topics
probability
of
selecting
n
instead
of
&lt;
/
&gt;
selects
between
n
and
^
for
word
topics
document
topic
model
word
topic
assignment
language
models
of
each
topic
document
words
Dirichlet
(
V
&gt;
o
)
Multinomial
(
V
&gt;
)
Dirichlet
(
6o
)
Multinomial
(
6Zdn
)
Figure
2
:
The
plate
diagram
for
our
model
.
Shaded
circles
denote
observed
variables
,
and
squares
denote
hyper
parameters
.
The
dotted
arrows
indicate
that
n
is
constructed
deterministically
from
x
and
h.
ciated
with
a
cluster
of
keyphrases
.
At
test
time
,
we
are
presented
with
documents
that
do
not
contain
keyphrase
annotations
.
The
hidden
topic
model
of
the
review
text
is
used
to
determine
the
properties
that
a
document
as
a
whole
supports
.
For
each
property
,
we
compute
the
proportion
of
the
document
's
words
assigned
to
it
.
Properties
with
proportions
above
a
set
threshold
(
tuned
on
a
development
set
)
are
predicted
as
being
supported
.
4.1
Keyphrase
Clustering
One
of
our
goals
is
to
cluster
the
keyphrases
,
such
that
each
cluster
corresponds
to
a
well-defined
property
.
We
represent
each
distinct
keyphrase
as
a
vector
of
similarity
scores
computed
over
the
set
of
observed
keyphrases
;
these
scores
are
represented
by
s
in
Figure
2
,
the
plate
diagram
of
our
model.1
Modeling
the
similarity
matrix
rather
than
the
sur
-
1We
assume
that
similarity
scores
are
conditionally
independent
given
the
keyphrase
clustering
,
though
the
scores
are
in
fact
related
.
Such
simplifying
assumptions
have
been
previously
used
with
success
in
NLP
(
e.g.
,
Toutanova
and
Johnson
,
2007
)
,
though
a
more
theoretically
sound
treatment
of
the
similarity
matrix
is
an
area
for
future
research
.
face
forms
allows
arbitrary
comparisons
between
keyphrases
,
e.g.
,
permitting
the
use
of
both
lexical
and
distributional
information
.
The
lexical
comparison
is
based
on
the
cosine
similarity
between
the
keyphrase
words
.
The
distributional
similarity
is
quantified
in
terms
of
the
co-occurrence
of
keyphrases
across
review
texts
.
Our
model
is
inherently
capable
of
using
any
arbitrary
source
of
similarity
information
;
for
a
discussion
of
similarity
metrics
,
see
Lin
(
1998
)
.
4.2
Document-level
Distributional
Analysis
Our
analysis
of
the
document
text
is
based
on
probabilistic
topic
models
such
as
LDA
(
Blei
et
al.
,
2003
)
.
In
the
LDA
framework
,
each
word
is
generated
from
a
language
model
that
is
indexed
by
the
word
's
topic
assignment
.
Thus
,
rather
than
identifying
a
single
topic
for
a
document
,
LDA
identifies
a
distribution
over
topics
.
Our
word
model
operates
similarly
,
identifying
a
topic
for
each
word
,
written
as
z
in
Figure
2
.
To
tie
these
topics
to
the
keyphrases
,
we
deterministi-cally
construct
a
document-specific
topic
distribu
-
tion
from
the
clusters
represented
by
the
document
's
keyphrases
—
this
is
n
in
the
figure
.
n
assigns
equal
probability
to
all
topics
that
are
represented
in
the
keyphrases
,
and
a
small
smoothing
probability
to
other
topics
.
As
noted
above
,
properties
may
be
expressed
in
the
text
even
when
no
related
keyphrase
appears
.
For
this
reason
,
we
also
construct
a
document-specific
topic
distribution
0
.
The
auxiliary
variable
c
indicates
whether
a
given
word
's
topic
is
drawn
from
the
set
of
keyphrase
clusters
,
or
from
this
topic
distribution
.
4.3
Generative
Process
In
this
section
,
we
describe
the
underlying
generative
process
more
formally
.
First
we
consider
the
set
of
all
keyphrases
observed
across
the
entire
corpus
,
of
which
there
are
L.
We
draw
a
multinomial
distribution
0
over
the
K
keyphrase
clusters
from
a
symmetric
Dirichlet
prior
0o
.
Then
for
the
Ith
keyphrase
,
a
cluster
assignment
xi
is
drawn
from
the
multinomial
0
.
Finally
,
the
similarity
matrix
s
e
[
0,1
]
LxL
is
constructed
.
Each
entry
si
;
i
'
is
drawn
independently
,
depending
on
the
cluster
assignments
xi
and
xl
'
.
Specifically
,
sl
;
l
/
is
drawn
from
a
Beta
distribution
with
parameters
a
=
if
xi
=
xl
'
and
a
=
otherwise
.
The
parameters
a
=
linearly
bias
si
;
i
'
towards
one
(
Beta
(
a
=
)
=
Beta
(
2,1
)
)
,
and
the
parameters
a
=
linearly
bias
si
;
i
'
towards
zero
(
Beta
(
a
=
)
=
Beta
(
1,2
)
)
.
Next
,
the
words
in
each
of
the
D
documents
are
generated
.
Document
d
has
Nd
words
;
zd
&gt;
n
is
the
topic
for
word
These
latent
topics
are
drawn
either
from
the
set
of
clusters
represented
by
the
document
's
keyphrases
,
or
from
the
document
's
topic
model
(
/
&gt;
d.
We
deterministically
construct
a
document-specific
keyphrase
topic
model
nd
,
based
on
the
keyphrase
cluster
assignments
x
and
the
observed
keyphrases
hd.
The
multinomial
nd
assigns
equal
probability
to
each
topic
that
is
represented
by
a
phrase
in
hd
,
and
a
small
probability
to
other
topics
.
As
noted
earlier
,
a
document
's
text
may
support
properties
that
are
not
mentioned
in
its
observed
keyphrases
.
For
that
reason
,
we
draw
a
document
topic
multinomial
0d
from
a
symmetric
Dirichlet
prior
0o
.
The
binary
auxiliary
variable
cd
n
determines
whether
the
word
's
topic
is
drawn
from
the
keyphrase
model
nd
or
the
document
topic
model
0d
.
cd
,
n
is
drawn
from
a
weighted
coin
flip
,
with
probability
A
;
A
is
drawn
from
a
Beta
distribution
with
prior
Ao
.
We
have
Zd
,
n
~
nd
if
Cd
,
n
=
1
,
and
zd
n
~
0d
otherwise
.
Finally
,
the
word
wd
n
is
drawn
from
the
multinomial
9Zd
n
,
where
zd
&gt;
n
indexes
a
topic-specific
language
model
.
Each
of
the
K
language
models
0k
is
drawn
from
a
symmetric
Dirichlet
prior
#0
.
5
Posterior
Sampling
Ultimately
,
we
need
to
compute
the
model
's
posterior
distribution
given
the
training
data
.
Doing
so
analytically
is
intractable
due
to
the
complexity
of
the
model
,
but
sampling-based
techniques
can
be
used
to
estimate
the
posterior
.
We
employ
Gibbs
sampling
,
previously
used
in
NLP
by
Finkel
et
al.
(
2005
)
and
Goldwater
et
al.
(
2006
)
,
among
others
.
This
technique
repeatedly
samples
from
the
conditional
distributions
of
each
hidden
variable
,
eventually
converging
on
a
Markov
chain
whose
stationary
distribution
is
the
posterior
distribution
of
the
hidden
variables
in
the
model
(
Gelman
et
al.
,
2004
)
.
We
now
present
sampling
equations
for
each
of
the
hidden
variables
in
Figure
2
.
The
prior
over
keyphrase
clusters
0
is
sampled
based
on
hyperprior
0o
and
keyphrase
cluster
assignments
x.
We
write
p
(
0
|
.
.
.
)
to
mean
the
probability
conditioned
on
all
the
other
variables
.
where
0i
=
0o
+
count
(
xi
=
i
)
.
This
update
rule
is
due
to
the
conjugacy
of
the
multinomial
to
the
Dirichlet
distribution
.
The
first
line
follows
from
Bayes
'
rule
,
and
the
second
line
from
the
conditional
independence
of
each
keyphrase
assignment
xi
from
the
others
,
given
0
.
Figure
3
:
The
resampling
equation
for
the
keyphrase
cluster
assignments
.
where
0d
,
i
=
0o
+
count
(
zd
,
n
=
i
A
Cd
,
n
=
0
)
and
^k
;
i
=
#o
+
Edcount
(
wd
,
n
=
i
A
zd
,
n
=
k
)
.
In
building
the
counts
for
0d
j
,
we
consider
only
cases
in
which
cd
n
=
0
,
indicating
that
the
topic
zd
n
is
indeed
drawn
from
the
document
topic
model
Similarly
,
when
building
the
counts
for
#k
,
we
consider
only
cases
in
which
the
word
wd
&gt;
n
is
drawn
from
topic
k.
To
resample
A
,
we
employ
the
conjugacy
of
the
Beta
prior
to
the
Bernoulli
observation
likelihoods
,
adding
counts
of
c
to
the
prior
Ao
.
Ed
count
(
cd
,
n
Ed
count
(
cd
The
keyphrase
cluster
assignments
are
represented
by
x
,
whose
sampling
distribution
depends
on
0
,
s
,
and
z
,
via
n.
The
equation
is
shown
in
Figure
3
.
The
first
term
is
the
prior
on
xi
.
The
second
term
encodes
the
dependence
of
the
similarity
matrix
s
on
the
cluster
assignments
;
with
slight
abuse
of
notation
,
we
write
OLXtXe
!
to
denote
a
=
if
xi
=
xi
'
,
and
a
=
otherwise
.
The
third
term
is
the
dependence
of
the
word
topics
zd
n
on
the
topic
distribution
nd.
We
compute
the
final
result
of
Figure
3
for
each
possible
setting
of
xi
,
and
then
sample
from
the
normalized
multinomial
.
The
word
topics
z
are
sampled
according
to
keyphrase
topic
distribution
nd
,
document
topic
distribution
words
w
,
and
auxiliary
variables
c
:
|
Mul
(
zd
,
n
;
0d
)
Mul
(
wd
,
n
;
^zd
,
n
)
otherwise
.
As
with
xi
,
each
zd
n
is
sampled
by
computing
the
conditional
likelihood
of
each
possible
setting
within
a
constant
of
proportionality
,
and
then
sampling
from
the
normalized
multinomial
.
Finally
,
we
sample
each
auxiliary
variable
cd
,
n
,
which
indicates
whether
the
hidden
topic
zd
,
n
is
drawn
from
nd
or
The
conditional
probability
for
cd
,
n
depends
on
its
prior
A
and
the
hidden
topic
assignments
zd
,
n
:
=
|
Bern
(
cd
,
n
;
A
)
Mul
(
zd
,
n
;
nd
)
if
Cd
,
n
=
1
,
|
Bern
(
cd
,
n
;
A
)
Mul
(
zd
,
n
;
0d
)
otherwise
.
We
compute
the
likelihood
of
cd
,
n
=
0
and
cd
,
n
=
1
withina
constantofproportionality
,
andthensample
from
the
normalized
Bernoulli
distribution
.
6
Experimental
Setup
Data
Sets
We
evaluate
our
system
on
reviews
from
two
categories
,
restaurants
and
cell
phones
.
These
reviews
were
downloaded
from
the
popular
Epin-ions2
website
.
Users
of
this
website
evaluate
products
by
providing
both
a
textual
description
of
their
opinion
,
as
well
as
concise
lists
of
keyphrases
(
pros
and
cons
)
summarizing
the
review
.
The
statistics
of
this
dataset
are
provided
in
Table
1
.
For
each
of
the
categories
,
we
randomly
selected
50
%
,
15
%
,
and
35
%
of
the
documents
as
training
,
development
,
and
test
sets
,
respectively
.
Manual
analysis
of
this
data
reveals
that
authors
often
omit
properties
mentioned
in
the
text
from
the
list
of
keyphrases
.
To
obtain
a
complete
gold
Avg
.
review
length
Avg
.
keyphrases
/
review
Table
1
:
Statistics
ofthe
reviews
datasetbycategory
.
standard
,
we
hand-annotated
a
subset
of
the
reviews
from
the
restaurant
category
.
The
annotation
effort
focused
on
eight
commonly
mentioned
properties
,
such
as
those
underlying
the
keyphrases
"
pleasant
atmosphere
"
and
"
attentive
staff
.
"
Two
raters
annotated
160
reviews
,
30
of
which
were
annotated
by
both
.
Cohen
's
kappa
,
a
measure
of
interrater
agreement
ranging
from
zero
to
one
,
was
0.78
for
this
subset
,
indicating
high
agreement
(
Cohen
,
1960
)
.
Each
review
was
annotated
with
2.56
properties
on
average
.
Each
manually-annotated
property
corresponded
to
an
average
of
19.1
keyphrases
in
the
restaurant
data
,
and
6.7
keyphrases
in
the
cell
phone
data
.
This
supports
our
intuition
that
a
single
semantic
property
may
be
expressed
using
a
variety
of
different
keyphrases
.
Training
Our
model
needs
to
be
provided
with
the
number
of
clusters
K.
We
set
K
large
enough
for
the
model
to
learn
effectively
on
the
development
set
.
For
the
restaurant
data
—
where
the
gold
standard
identified
eight
semantic
properties
—
we
set
K
to
20
,
allowing
the
model
to
account
for
keyphrases
not
included
in
the
eight
most
common
properties
.
For
the
cell
phones
category
,
we
set
K
to
30
.
To
improve
the
model
's
convergence
rate
,
we
perform
two
initialization
steps
for
the
Gibbs
sampler
.
First
,
sampling
is
done
only
on
the
keyphrase
clustering
component
of
the
model
,
ignoring
document
text
.
Second
,
we
fix
this
clustering
and
sample
the
remaining
model
parameters
.
These
two
steps
are
run
for
5,000
iterations
each
.
The
full
joint
model
is
then
sampled
for
100,000
iterations
.
Inspection
of
the
parameter
estimates
confirms
model
convergence
.
On
a
2GHz
dual-core
desktop
machine
,
a
multi-threaded
C++
implementation
of
model
training
takes
about
two
hours
for
each
dataset
.
Inference
The
final
point
estimate
used
for
testing
is
an
average
(
for
continuous
variables
)
or
a
mode
(
for
discrete
variables
)
over
the
last
1,000
Gibbs
sampling
iterations
.
Averaging
is
a
heuristic
that
is
applicable
in
our
case
because
our
sam
-
ple
histograms
are
unimodal
and
exhibit
low
skew
.
The
model
usually
works
equally
well
using
singlesample
estimates
,
but
is
more
prone
to
estimation
noise
.
As
previously
mentioned
,
we
convert
word
topic
assignments
to
document
properties
by
examining
the
proportion
of
words
supporting
each
property
.
A
threshold
for
this
proportion
is
set
for
each
property
via
the
development
set
.
Evaluation
Our
first
evaluation
examines
the
accuracy
of
our
model
and
the
baselines
by
comparing
their
output
against
the
keyphrases
provided
by
the
review
authors
.
More
specifically
,
the
model
first
predicts
the
properties
supported
by
a
given
review
.
We
then
test
whether
the
original
authors
'
keyphrases
are
contained
in
the
clusters
associated
with
these
properties
.
As
noted
above
,
the
authors
'
keyphrases
are
often
incomplete
.
To
perform
a
noise-free
comparison
,
we
based
our
second
evaluation
on
the
manually
constructed
gold
standard
for
the
restaurant
category
.
We
took
the
most
commonly
observed
keyphrase
from
each
of
the
eight
annotated
properties
,
and
tested
whether
they
are
supported
by
the
model
based
on
the
document
text
.
In
both
types
of
evaluation
,
we
measure
the
model
's
performance
using
precision
,
recall
,
and
F-score
.
These
are
computed
in
the
standard
manner
,
based
on
the
model
's
keyphrase
predictions
compared
against
the
corresponding
references
.
The
sign
test
was
used
for
statistical
significance
testing
(
De
Groot
and
Schervish
,
2001
)
.
Baselines
To
the
best
of
our
knowledge
,
this
task
not
been
previously
addressed
in
the
literature
.
We
therefore
consider
five
baselines
that
allow
us
to
explore
the
properties
of
this
task
and
our
model
.
Random
:
Each
keyphrase
is
supported
by
a
document
with
probability
of
one
half
.
This
baseline
's
results
are
computed
(
in
expectation
)
rather
than
actually
run
.
This
method
is
expected
to
have
a
recall
of
0.5
,
because
in
expectation
it
will
select
half
of
the
correct
keyphrases
.
Its
precision
is
the
proportion
of
supported
keyphrases
in
the
test
set
.
Phrase
in
text
:
A
keyphrase
is
supported
by
a
document
if
it
appears
verbatim
in
the
text
.
Because
of
this
narrow
requirement
,
precision
should
be
high
whereas
recall
will
be
low
.
gold
standard
annotation
free-text
annotation
Prec
.
Phrase
in
text
Cluster
in
text
Phrase
classifier
Cluster
classifier
Our
model
Our
model
+
gold
clusters
Table2
:
Comparisonofthepropertypredictionsmadebyourmodelandthebaselinesinthetwocategoriesasevaluated
against
the
gold
and
free-text
annotations
.
Results
for
our
model
using
the
fixed
,
manually-created
gold
clusterings
are
also
shown
.
The
methods
against
which
our
model
has
significantly
better
results
on
the
sign
test
are
indicated
with
a
*
forp
&lt;
=
0.05
,
and
o
forp
&lt;
=
0.1
.
Cluster
in
text
:
A
keyphrase
is
supported
by
a
document
if
it
or
any
of
its
paraphrases
appears
in
the
text
.
Paraphrasing
is
based
on
our
model
's
clustering
of
the
keyphrases
.
The
use
of
paraphrasing
information
enhances
recall
at
the
potential
cost
of
precision
,
depending
on
the
quality
of
the
clustering
.
Phrase
classifier
:
Discriminative
classifiers
are
trained
for
each
keyphrase
.
Positive
examples
are
documents
that
are
labeled
with
the
keyphrase
;
all
other
documents
are
negative
examples
.
A
keyphrase
is
supported
by
a
document
if
that
keyphrase
's
classifier
returns
positive
.
Cluster
classifier
:
Discriminative
classifiers
are
trained
for
each
cluster
of
keyphrases
,
using
our
model
's
clustering
.
Positive
examples
are
documents
that
are
labeled
with
any
keyphrase
from
the
cluster
;
all
other
documents
are
negative
examples
.
All
keyphrases
of
a
cluster
are
supported
by
a
document
if
that
cluster
's
classifier
returns
positive
.
Phrase
classifier
and
cluster
classifier
employ
maximum
entropy
classifiers
,
trained
on
the
same
features
as
our
model
,
i.e.
,
word
counts
.
The
former
is
high-precision
/
low-recall
,
because
for
any
particular
keyphrase
,
its
synonymous
keyphrases
would
be
considered
negative
examples
.
The
latter
broadens
the
positive
examples
,
which
should
improve
recall
.
We
used
Zhang
Le
's
MaxEnt
toolkit3
to
build
these
classifiers
.
7
Results
Comparative
performance
Table
2
presents
the
results
of
the
evaluation
scenarios
described
above
.
Our
model
outperforms
every
baseline
by
a
wide
margin
in
all
evaluations
.
The
absolute
performance
of
the
automatic
methods
indicates
the
difficulty
of
the
task
.
For
instance
,
evaluation
against
gold
standard
annotations
shows
that
the
random
baseline
outperforms
all
of
the
other
baselines
.
We
observe
similar
disappointing
results
for
the
non-random
baselines
against
the
free-text
annotations
.
The
precision
and
recall
characteristics
of
the
baselines
match
our
previously
described
expectations
.
The
poor
performance
of
the
discriminative
models
seems
surprising
at
first
.
However
,
these
results
can
be
explained
by
the
degree
of
noise
in
the
training
data
,
specifically
,
the
aforementioned
sparsity
of
free-text
annotations
.
As
previously
described
,
our
technique
allows
document
text
topics
to
stochastically
derive
from
either
the
keyphrases
or
a
background
distribution
—
this
allows
our
model
to
learn
effectively
from
incomplete
annotations
.
In
fact
,
when
we
force
all
text
topics
to
derive
from
keyphrase
clusters
in
our
model
,
its
performance
degrades
to
the
level
of
the
classifiers
or
worse
,
with
an
F-score
of
0.390
in
the
restaurant
category
and
0.171
in
the
cell
phone
category
.
Impact
of
paraphrasing
As
previously
observed
in
entailment
research
(
Dagan
et
al.
,
2006
)
,
paraphrasing
information
contributes
greatly
to
improved
performance
on
semantic
inference
.
This
is
small
size
compact
size
great
size
good
size
tiny
size
nice
size
sleek
battery
life
short
battery
life
poor
battery
life
low
battery
life
bad
battery
life
so
so
battery
life
battery
life
could
be
better
terrible
battery
life
style
cute
nice
design
nice
looking
looks
cool
looks
great
good
looks
styling
functionality
clear
calls
Figure
4
:
Sample
keyphrase
clusters
that
our
model
infers
in
the
cell
phone
category
.
confirmed
by
the
dramatic
difference
in
results
between
the
cluster
in
text
and
phrase
in
text
baselines
.
Therefore
it
is
important
to
quantify
the
quality
of
automatically
computed
paraphrases
,
such
as
those
illustrated
in
Figure
4
.
Restaurants
Cell
Phones
Keyphrase
similarity
only
Joint
training
Table
3
:
Rand
Index
scores
of
our
model
's
clusters
,
using
only
keyphrase
similarity
vs.
using
keyphrases
and
text
jointly
.
Comparison
of
cluster
quality
is
against
the
gold
standard
.
One
way
to
assess
clustering
quality
is
to
compare
it
against
a
"
gold
standard
"
clustering
,
as
constructed
in
Section
6
.
For
this
purpose
,
we
use
the
Rand
Index
(
Rand
,
1971
)
,
a
measure
of
cluster
similarity
.
This
measure
varies
from
zero
to
one
;
higher
scores
are
better
.
Table
3
shows
the
Rand
Indices
for
our
model
's
clustering
,
as
well
as
the
clustering
obtained
by
using
only
keyphrase
similarity
.
These
scores
confirm
that
joint
inference
produces
better
clusters
than
using
only
keyphrases
.
Another
way
of
assessing
cluster
quality
is
to
consider
the
impact
ofusing
the
gold
standard
clustering
instead
of
our
model
's
clustering
.
As
shown
in
the
last
two
lines
of
Table
2
,
using
the
gold
clustering
yields
results
worse
than
using
the
model
clustering
.
This
indicates
that
for
the
purposes
of
our
task
,
the
model
clustering
is
of
sufficient
quality
.
8
Conclusions
and
Future
Work
In
this
paper
,
we
have
shown
how
free-text
annotations
provided
by
novice
users
can
be
leveraged
as
a
training
set
for
document-level
semantic
inference
.
The
resulting
hierarchical
Bayesian
model
overcomes
the
lack
of
consistency
in
such
annotations
by
inducing
a
hidden
structure
of
semantic
properties
,
which
correspond
both
to
clusters
of
keyphrases
and
hidden
topic
models
in
the
text
.
Our
system
successfully
extracts
semantic
properties
of
unannotated
restaurant
and
cell
phone
reviews
,
empirically
validating
our
approach
.
Our
present
model
makes
strong
assumptions
about
the
independence
of
similarity
scores
.
We
believe
this
could
be
avoided
by
modeling
the
generation
of
the
entire
similarity
matrix
jointly
.
We
have
also
assumed
that
the
properties
themselves
are
unstructured
,
but
they
are
in
fact
related
in
interesting
ways
.
For
example
,
it
would
be
desirable
to
model
antonyms
explicitly
,
e.g.
,
no
restaurant
review
should
be
simultaneously
labeled
as
having
good
and
bad
food
.
The
correlated
topic
model
(
Blei
and
Lafferty
,
2006
)
is
one
way
to
account
for
relationships
between
hidden
topics
;
more
structured
representations
,
such
as
hierarchies
,
may
also
be
considered
.
Finally
,
the
core
idea
of
using
free-text
as
a
source
of
training
labels
has
wide
applicability
,
and
has
the
potential
to
enable
sophisticated
content
search
and
analysis
.
For
example
,
online
blog
entries
are
often
tagged
with
short
keyphrases
.
Our
technique
could
be
used
to
standardize
these
tags
,
and
assign
keyphrases
to
untagged
blogs
.
The
notion
of
free-text
annotations
is
also
very
broad
—
we
are
currently
exploring
the
applicability
of
this
model
to
Wikipedia
articles
,
using
section
titles
as
keyphrases
,
to
build
standard
article
schemas
.
Acknowledgments
The
authors
acknowledge
the
support
of
the
NSF
,
Quanta
Computer
,
the
U.S.
Office
of
Naval
Research
,
and
DARPA
.
Thanks
to
Michael
Collins
,
Dina
Katabi
,
Kristian
Kersting
,
Terry
Koo
,
Brian
Milch
,
Tahira
Naseem
,
Dan
Roy
,
Benjamin
Snyder
,
Luke
Zettlemoyer
,
and
the
anonymous
reviewers
for
helpful
comments
and
suggestions
.
Any
opinions
,
findings
,
and
conclusions
or
recommendations
expressed
above
are
those
of
the
authors
and
do
not
necessarily
reflect
the
views
of
the
NSF
.
