We
present
V-measure
,
an
external
entropy-based
cluster
evaluation
measure
.
V-measure
provides
an
elegant
solution
to
many
problems
that
affect
previously
defined
cluster
evaluation
measures
including
1
)
dependence
on
clustering
algorithm
or
data
set
,
2
)
the
"
problem
of
matching
"
,
where
the
clustering
of
only
a
portion
of
data
points
are
evaluated
and
3
)
accurate
evaluation
and
combination
of
two
desirable
aspects
of
clustering
,
homogeneity
and
completeness
.
We
compare
V-measure
to
a
number
of
popular
cluster
evaluation
measures
and
demonstrate
that
it
satisfies
several
desirable
properties
of
clustering
solutions
,
using
simulated
clustering
results
.
Finally
,
we
use
V-measure
to
evaluate
two
clustering
tasks
:
document
clustering
and
pitch
accent
type
clustering
.
1
Introduction
Clustering
techniques
have
been
used
successfully
for
many
natural
language
processing
tasks
,
such
as
document
clustering
(
Willett
,
1988
;
Zamir
and
Etzioni
,
1998
;
Cutting
et
al.
,
1992
;
Vempala
and
Wang
,
2005
)
,
word
sense
disambiguation
(
Shin
and
Choi
,
2004
)
,
semantic
role
labeling
(
Baldewein
et
al.
,
2004
)
,
pitch
accent
type
disambiguation
(
Levow
,
2006
)
.
They
are
particularly
appealing
for
tasks
in
which
there
is
an
abundance
of
language
data
available
,
but
manual
annotation
of
this
data
is
very
resource-intensive
.
Unsupervised
clustering
can
eliminate
the
need
for
(
full
)
manual
annotation
of
the
data
into
desired
classes
,
but
often
at
the
cost
of
making
evaluation
of
success
more
difficult
.
External
evaluation
measures
for
clustering
can
be
applied
when
class
labels
for
each
data
point
in
some
evaluation
set
can
be
determined
a
priori
.
The
clustering
task
is
then
to
assign
these
data
points
to
any
number
of
clusters
such
that
each
cluster
contains
all
and
only
those
data
points
that
are
members
of
the
same
class
Given
the
ground
truth
class
labels
,
it
is
trivial
to
determine
whether
this
perfect
clustering
has
been
achieved
.
However
,
evaluating
how
far
from
perfect
an
incorrect
clustering
solution
is
a
more
difficult
task
(
Oakes
,
1998
)
and
proposed
approaches
often
lack
rigor
(
Meila
,
2007
)
.
In
this
paper
,
we
describe
a
new
entropy-based
external
cluster
evaluation
measure
,
V-measure1
,
designed
to
address
the
problem
ofquantifying
such
imperfection
.
Like
all
external
measures
,
V-measure
compares
a
target
clustering
—
e.g.
,
a
manually
annotated
representative
subset
of
the
available
data
—
against
an
automatically
generated
clustering
to
determine
now
similar
the
two
are
.
We
introduce
two
complementary
concepts
,
completeness
and
homogeneity
,
to
capture
desirable
properties
in
clustering
tasks
.
In
Section
2
,
we
describe
V-measure
and
how
it
is
calculated
in
terms
of
homogeneity
and
completeness
.
We
describe
several
popular
external
cluster
evaluation
measures
and
draw
some
comparisons
to
V-measure
in
Section
3
.
In
Section
4
,
we
discuss
how
some
desirable
properties
for
clustering
are
satisfied
by
V-measure
vs.
other
measures
.
In
Section
5
,
we
present
two
applications
ofV-measure
,
on
document
clustering
and
on
pitch
accent
type
clustering
.
2
V-Measure
and
Its
Calculation
V-measure
is
an
entropy-based
measure
which
explicitly
measures
how
successfully
the
criteria
of
homogeneity
and
completeness
have
been
satisfied
.
V-measure
is
computed
as
the
harmonic
mean
of
distinct
homogeneity
and
completeness
scores
,
just
as
1The
'
V
'
stands
for
"
validity
"
,
a
common
term
used
to
describe
the
goodness
of
a
clustering
solution
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
410-420
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
precision
and
recall
are
commonly
combined
into
F-measure
(
Van
Rijsbergen
,
1979
)
.
As
F-measure
scores
can
be
weighted
,
V-measure
can
be
weighted
to
favor
the
contributions
of
homogeneity
or
completeness
.
For
the
purposes
of
the
following
discussion
,
assume
a
data
set
comprising
N
data
points
,
and
two
partitions
of
these
:
a
set
of
classes
,
C
=
(
cj
|
i
=
1
,
.
.
.
,
n
}
and
a
set
of
clusters
,
K
=
[
ki
|
1
,
.
.
.
,
m
}
.
Let
A
be
the
contingency
table
produced
by
the
clustering
algorithm
representing
the
clustering
solution
,
such
that
A
=
[
aij
}
where
aij
is
the
number
of
data
points
that
are
members
of
class
ci
and
elements
of
cluster
kj.
To
discuss
cluster
evaluation
measures
we
introduce
two
criteria
for
a
clustering
solution
:
homogeneity
and
completeness
.
A
clustering
result
satisfies
homogeneity
if
all
of
its
clusters
contain
only
data
points
which
are
members
of
a
single
class
.
A
clustering
result
satisfies
completeness
if
all
the
data
points
that
are
members
of
a
given
class
are
elements
of
the
same
cluster
.
The
homogenity
and
completeness
of
a
clustering
solution
run
roughly
in
opposition
:
Increasing
the
homogeneity
of
a
clustering
solution
often
results
in
decreasing
its
completeness
.
Consider
,
two
degenerate
clustering
solutions
.
In
one
,
assigning
every
datapoint
into
a
single
cluster
,
guarantees
perfect
completeness
—
all
of
the
data
points
that
are
members
of
the
same
class
are
trivially
elements
of
the
same
cluster
.
However
,
this
cluster
is
as
wwhomogeneous
as
possible
,
since
all
classes
are
included
in
this
single
cluster
.
In
another
solution
,
assigning
each
data
point
to
a
distinct
cluster
guarantees
perfect
homogeneity
—
each
cluster
trivially
contains
only
members
of
a
single
class
.
However
,
in
terms
of
completeness
,
this
solution
scores
very
poorly
,
unless
indeed
each
class
contains
only
a
single
member
.
We
define
the
distance
from
a
perfect
clustering
is
measured
as
the
weighted
harmonic
mean
of
measures
of
homogeneity
and
completeness
.
Homogeneity
:
In
order
to
satisfy
our
homogeneity
criteria
,
a
clustering
must
assign
only
those
datapoints
that
are
members
of
a
single
class
to
a
single
cluster
.
That
is
,
the
class
distribution
within
each
cluster
should
be
skewed
to
a
single
class
,
that
is
,
zero
entropy
.
We
determine
how
close
a
given
clustering
is
to
this
ideal
by
examining
the
conditional
entropy
of
the
class
distribution
given
the
proposed
clustering
.
In
the
perfectly
homogeneous
case
,
this
value
,
H
(
C
|
K
)
,
is
0
.
However
,
in
an
imperfect
situation
,
the
size
of
this
value
,
in
bits
,
is
dependent
on
the
size
of
the
dataset
and
the
distribution
of
class
sizes
.
Therefore
,
instead
of
taking
the
raw
conditional
entropy
,
we
normalize
this
value
by
the
maximum
reduction
in
entropy
the
clustering
information
could
provide
,
specifically
,
H
(
C
)
.
Note
that
H
(
C
|
K
)
is
maximal
(
and
equals
H
(
C
)
)
when
the
clustering
provides
no
new
information
—
the
class
distribution
within
each
cluster
is
equal
to
the
overall
class
distribiution
.
H
(
C
|
K
)
is
0
when
each
cluster
contains
only
members
ofa
single
class
,
a
perfectly
homogenous
clustering
.
In
the
degenerate
case
where
H
(
C
)
=
0
,
when
there
is
only
a
single
class
,
we
define
homogeneity
to
be
1
.
For
a
perfectly
homogenous
solution
,
this
normalization
,
H^
(
c
)
&gt;
'
ecluals
0
-
Thus
,
to
adhere
to
the
convention
of
1
being
desirable
and
0
undesirable
,
we
define
homogeneity
as
:
Completeness
:
Completeness
is
symmetrical
to
homogeneity
.
In
order
to
satisfy
the
completeness
criteria
,
a
clustering
must
assign
all
of
those
datapoints
that
are
members
of
a
single
class
to
a
single
cluster
.
To
evaluate
completeness
,
we
examine
the
distribution
of
cluster
assignments
within
each
class
.
In
a
perfectly
complete
clustering
solution
,
each
of
these
distributions
will
be
completely
skewed
to
a
single
cluster
.
We
can
evaluate
this
degree
of
skew
by
calculating
the
conditional
entropy
of
the
proposed
cluster
distribution
given
the
class
of
the
component
datapoints
,
H
(
K
|
C
)
.
In
the
perfectly
complete
case
,
H
(
K
|
C
)
=
0
.
However
,
in
the
worst
case
scenario
,
each
class
is
represented
by
every
cluster
with
a
distribution
equal
to
the
distribution
of
cluster
sizes
,
H
(
K
|
C
)
is
maximal
and
equals
H
(
K
)
.
Finally
,
in
the
degenerate
case
where
H
(
K
)
=
0
,
when
there
is
a
single
cluster
,
we
define
completeness
to
be
1
.
Therefore
,
symmetric
to
the
calculation
above
,
we
define
completeness
as
:
Based
upon
these
calculations
of
homogeneity
and
completeness
,
we
then
calculate
a
clustering
solution
's
V-measure
by
computing
the
weighted
harmonic
mean
of
homogeneity
and
completeness
,
&lt;
"
1
(
/
3^
)
*
+
*
C
-
Similarly
to
the
familiar
F
-
measure
,
if
p
is
greater
than
1
completeness
is
weighted
more
strongly
in
the
calculation
,
if
fi
is
less
than
1
,
homogeneity
is
weighted
more
strongly
.
Notice
that
the
computations
of
homogeneity
,
completeness
and
V-measure
are
completely
independent
of
the
number
of
classes
,
the
number
of
clusters
,
the
size
of
the
data
set
and
the
clustering
algorithm
used
.
Thus
these
measures
can
be
applied
to
and
compared
across
any
clustering
solution
,
regardless
of
the
number
of
data
points
(
n-invariance
)
,
the
number
of
classes
or
the
number
of
clusters
.
Moreover
,
by
calculating
homogeneity
and
completeness
separately
,
a
more
precise
evaluation
of
the
performance
of
the
clustering
can
be
obtained
.
3
Existing
Evaluation
Measures
Clustering
algorithms
divide
an
input
data
set
into
a
number
of
partitions
,
or
clusters
.
For
tasks
where
some
target
partition
can
be
defined
for
testing
purposes
,
we
define
a
"
clustering
solution
"
as
a
mapping
from
each
data
point
to
its
cluster
assignments
in
both
the
target
and
hypothesized
clustering
.
In
the
context
of
this
discussion
,
we
will
refer
to
the
target
partitions
,
or
clusters
,
as
CLASSES
,
referring
only
to
hypothesized
clusters
as
CLUSTERS
.
Two
commonly
used
external
measures
for
assessing
clustering
success
are
Purity
and
Entropy
(
Zhao
and
Karypis
,
2001
)
,
defined
as
,
where
q
is
the
number
of
classes
,
k
the
number
of
clusters
,
nr
is
the
size
of
cluster
r
,
and
n
]
,
is
the
number
of
data
points
in
class
i
clustered
in
cluster
r.
Both
these
approaches
represent
plausable
ways
to
evaluate
the
homogeneity
of
a
clustering
solution
.
However
,
our
completeness
criterion
is
not
measured
at
all
.
That
is
,
they
do
not
address
the
question
of
whether
all
members
of
a
given
class
are
included
in
a
single
cluster
.
Therefore
the
Purity
and
Entropy
measures
are
likely
to
improve
(
increased
Purity
,
decreased
Entropy
)
monotonically
with
the
number
of
clusters
in
the
result
,
up
to
a
degenerate
maximum
where
there
are
as
many
clusters
as
data
points
.
However
,
clustering
solutions
rated
high
by
either
measure
may
still
be
far
from
ideal
.
Another
frequently
used
external
clustering
evaluation
measure
is
commonly
refered
to
as
"
clustering
accuracy
"
.
The
calculation
of
this
accuracy
is
inspired
by
the
information
retrieval
metric
of
F-Measure
(
Van
Rijsbergen
,
1979
)
.
The
formula
for
this
clustering
F-measure
as
described
in
(
Fung
et
al.
,
2003
)
is
shown
in
Figure
3
.
Let
N
be
the
number
of
data
points
,
C
the
set
of
classes
,
K
the
set
of
clusters
and
rnj
be
the
number
of
members
of
class
at
e
C
that
are
elements
of
cluster
kj
e
K.
Figure
1
:
Calculation
ofclustering
F-measure
This
measure
has
a
significant
advantage
over
Purity
and
Entropy
,
in
that
it
does
measure
both
the
homogeneity
and
the
completeness
of
a
clustering
solution
.
Recall
is
calculated
as
the
portion
of
items
from
class
i
that
are
present
in
cluster
j
,
thus
measuring
how
complete
cluster
j
is
with
respect
to
class
i.
Similarly
,
Precision
is
calculated
as
the
por
-
Solution
A
Solution
D
Figure
2
:
Examples
of
the
Problem
of
Matching
tion
of
cluster
j
that
is
a
member
of
class
i
,
thus
measuring
how
homogenous
cluster
j
is
with
respect
to
class
i.
Like
some
other
external
cluster
evaluation
techniques
(
misclassification
index
(
MI
)
(
Zeng
et
al.
,
and
Aone
,
1999
)
,
D
(
van
Dongen
,
2000
)
,
micro-averaged
precision
and
recall
(
Dhillon
et
al.
,
2003
)
)
,
F-measure
relies
on
a
post-processing
step
in
which
each
cluster
is
assigned
to
a
class
.
These
techniques
share
certain
problems
.
First
,
they
calculate
the
goodness
not
only
of
the
given
clustering
solution
,
but
also
of
the
cluster-class
matching
.
Therefore
,
in
order
for
the
goodness
of
two
clustering
solutions
to
be
compared
using
one
these
measures
,
an
identical
post-processing
algorithm
must
be
used
.
This
problem
can
be
trivially
addressed
by
fixing
the
class-cluster
matching
function
and
including
it
in
the
definition
of
the
measure
as
in
H.
However
,
a
second
and
more
critical
problem
is
the
"
problem
of
matching
"
(
Meila
,
2007
)
.
In
calculating
the
similarity
between
a
hypothesized
clustering
and
a
'
true
'
clustering
,
these
measures
only
consider
the
contributions
from
those
clusters
that
are
matched
to
a
target
class
.
This
is
a
major
problem
,
as
two
significantly
different
clusterings
can
result
in
identical
scores
.
In
figure
2
,
we
present
some
illustrative
examples
of
the
problem
of
matching
.
For
the
purposes
of
this
discussion
we
will
be
using
F-Measure
as
the
measure
to
describe
the
problem
of
matching
,
however
,
these
problems
affect
any
measure
which
requires
a
mapping
from
clusters
to
classes
for
evaluation
.
In
the
figures
,
the
shaded
regions
represent
clusters
,
the
shapes
represent
classes
.
In
a
perfect
clustering
,
each
shaded
region
would
contain
all
and
only
the
same
shapes
.
The
problem
of
matching
can
manifest
itself
either
by
not
evaluating
the
entire
membership
of
a
cluster
,
or
by
not
evaluating
every
cluster
.
The
former
situation
is
presented
in
the
figures
A
and
B
in
figure
2
.
The
F-Measure
of
both
of
these
clustering
solutions
in
0.6
.
(
The
precision
and
recall
for
each
class
is
|
.
)
That
is
,
for
each
class
,
the
best
or
"
matched
"
cluster
contains
3
of
5
elements
of
the
class
(
Recall
)
and
3
of
5
elements
of
the
cluster
are
members
ofthe
class
(
Precision
)
.
The
make
up
of
the
clusters
beyond
the
majority
class
is
not
evaluated
by
F-Measure
.
Solution
B
is
a
better
clustering
solution
than
solution
A
,
in
terms
of
both
homogeneity
(
crudely
,
"
each
cluster
contains
fewer2
classes
"
)
and
completeness
(
"
each
class
is
contained
in
fewer
clusters
"
)
.
Indeed
,
the
V-Measure
of
solution
B
(
0.387
)
is
greater
than
that
of
solution
A
(
0.135
)
.
Solutions
C
and
D
represent
a
case
in
which
not
every
cluster
is
considered
in
the
evaluation
of
F-Measure
.
In
this
example
,
the
F-Measure
of
both
solutions
is
0.5
(
the
harmonic
mean
of
|
and
f
)
.
The
small
"
unmatched
"
clusters
are
not
measured
at
all
in
the
calculation
of
F-Measure
.
Solution
D
is
a
better
clustering
than
solution
C
-
there
are
no
incorrect
clusterings
of
different
classes
in
the
small
clusters
.
V-Measure
reflects
this
,
solution
C
has
a
V-measure
of
0.30
while
the
V-measure
of
solution
D
is
0.41
.
A
second
class
of
clustering
evaluation
techniques
is
based
on
a
combinatorial
approach
which
examines
the
number
of
pairs
of
data
points
that
are
clustered
similarly
in
the
target
and
hypothesized
clustering
.
That
is
,
each
pair
of
points
can
either
be
1
)
clustered
together
in
both
clusterings
(
N11
)
,
2
)
clustered
separately
in
both
clusterings
(
N00
)
,
3
)
clustered
together
in
the
hypothesized
but
not
the
target
clustering
(
N01
)
or
4
)
clustered
together
in
the
target
but
not
in
the
hypothesized
clustering
(
N10
)
.
Based
on
these
4
values
,
a
number
ofmeasures
have
been
proposed
,
including
Rand
Index
(
Rand
,
1971
)
,
2Homogeneity
is
not
measured
by
V-measure
as
a
count
of
the
number
of
classes
contained
by
a
cluster
but
"
fewer
"
is
an
acceptable
way
to
conceptualize
this
criterion
for
the
purposes
of
these
examples
.
can
be
interpreted
as
the
probability
that
a
pair
of
points
is
clustered
similarly
(
together
or
separately
)
in
C
and
K.
Meila
(
2007
)
describes
a
number
of
potential
problems
of
this
class
of
measures
posed
by
(
Fowlkes
and
Mallows
,
1983
)
and
(
Wallace
,
1983
)
.
The
most
basic
is
that
these
measures
tend
not
to
vary
over
the
interval
of
[
0,1
]
.
Transformations
like
those
applied
by
the
adjusted
Rand
Index
and
a
minor
adjustment
to
the
Mirkin
measure
(
see
Section
4
)
can
address
this
problem
.
However
,
pair
matching
measures
also
suffer
from
distributional
problems
.
The
baseline
for
Fowlkes-Mallows
varies
significantly
between
0.6
and
0
when
the
ratio
of
data
points
to
clusters
is
greater
than
3
—
thus
including
nearly
all
real-world
clustering
problems
.
Similarly
,
the
Adjusted
Rand
Index
,
as
demonstrated
using
Monte
Carlo
simulations
in
(
Fowlkes
and
Mallows
,
1983
)
,
varies
from
0.5
to
0.95
.
This
variance
in
the
measure
's
baseline
prompts
Meila
to
ask
if
the
assumption
of
linearity
following
normalization
can
be
maintained
.
If
the
behavior
of
the
measure
is
so
unstable
before
normalization
can
users
reasonably
expect
stable
behavior
following
normalization
?
A
final
class
of
cluster
evaluation
measures
are
based
on
information
theory
.
These
measures
analyze
the
distribution
of
class
and
cluster
membership
in
order
to
determine
how
successful
a
given
clustering
solution
is
or
how
different
two
partitions
of
a
data
set
are
.
We
have
already
examined
one
member
of
this
class
of
measures
,
Entropy
.
From
a
coding
theory
perspective
,
Entropy
is
the
weighted
average
of
the
code
lengths
of
each
cluster
.
Our
V-measure
is
a
member
of
this
class
of
clustering
measures
.
One
significant
advantage
that
information
theoretic
evaluation
measures
have
is
that
they
provide
an
elegant
solution
to
the
"
problem
of
matching
"
.
By
examining
the
relative
sizes
of
the
classes
and
clusters
being
evaluated
,
these
measures
all
evaluate
the
entire
membership
of
each
cluster
—
not
just
a
'
matched
'
portion
.
entropy
,
H
(
C
|
K
)
to
calculate
the
goodness
of
a
clustering
solution
.
That
is
,
given
the
hypothesized
partition
,
what
is
the
number
of
bits
necessary
to
represent
the
true
clustering
?
However
,
this
term
-
like
the
Purity
and
Entropy
measures
-
only
evaluates
the
homogeneity
ofa
solution
.
To
measure
the
completeness
ofthe
hypothesized
clustering
,
Dom
includes
a
model
cost
term
calculated
using
a
coding
theory
argument
.
The
overall
clustering
quality
measure
presented
is
the
sum
of
the
costs
of
representing
the
data
(
H
(
C
|
K
)
)
and
the
model
.
The
motivation
for
this
approach
is
an
appeal
to
parsimony
:
Given
identical
conditional
entropies
,
H
(
CIK
)
,
the
clustering
solution
with
the
fewest
clusters
should
be
preferred
.
Dom
also
presents
a
normalized
version
of
this
term
,
Q2
,
which
has
a
range
of
(
0,1
]
with
greater
scores
being
representing
more
preferred
clusterings
.
where
C
is
the
target
partition
,
K
is
the
hypothesized
partition
and
h
(
k
)
is
the
size
of
cluster
k.
We
believe
that
V-measure
provides
two
significant
advantages
over
Qo
that
make
it
a
more
useful
diagnostic
tool
.
First
,
Q0
does
not
explicitly
calculate
the
degree
of
completeness
of
the
clustering
solution
.
The
cost
term
captures
some
of
this
information
,
since
a
partition
with
fewer
clusters
is
likely
to
be
more
complete
than
a
clustering
solution
with
more
clusters
.
However
,
Q0
does
not
explicitly
address
the
interaction
between
the
conditional
entropy
and
the
cost
of
representing
the
model
.
While
this
is
an
application
of
the
minimum
description
length
(
MDL
)
principle
(
Rissanen
,
1978
;
Rissanen
,
1989
)
,
it
does
not
provide
an
intuitive
manner
for
assessing
our
two
competing
criteria
ofhomogeneity
and
completeness
.
That
is
,
at
what
point
does
an
increase
in
conditional
entropy
(
homogeneity
)
justify
a
reduction
in
the
number
of
clusters
(
completeness
)
.
as
a
distance
measure
for
comparing
partitions
(
or
clusterings
)
of
the
same
data
.
It
therefore
does
not
distinguish
between
hypothesized
and
target
clusterings
.
VI
has
a
number
of
useful
properties
.
First
,
it
satisfies
the
metric
axioms
.
This
quality
allows
users
to
intuitively
understand
how
VI
values
combine
and
relate
to
one
another
.
Secondly
,
it
is
"
con-vexly
additive
.
That
is
to
say
,
if
a
cluster
is
split
,
the
distance
from
the
new
cluster
to
the
original
is
the
distance
induced
by
the
split
times
the
size
of
the
cluster
.
This
property
guarantees
that
all
changes
to
the
metric
are
"
local
"
:
the
impact
of
splitting
or
merging
clusters
is
limited
to
only
those
clusters
involved
,
and
its
size
is
relative
to
the
size
of
these
clusters
.
Third
,
VI
is
n-invariant
:
the
number
of
data
points
in
the
cluster
do
not
affect
the
value
of
the
measure
.
VI
depends
on
the
relative
sizes
of
the
partitions
of
C
and
K
,
not
on
the
number
of
points
in
these
partitions
.
However
,
VI
is
bounded
by
the
maximum
number
of
clusters
in
C
or
K
,
k
*
.
Without
manual
modification
however
,
k
*
=
n
,
where
each
cluster
contains
only
a
single
data
point
.
Thus
,
while
technically
n-invariant
,
the
possible
values
of
VI
are
heavily
dependent
on
the
number
of
data
points
being
clustered
.
Thus
,
it
is
difficult
to
compare
VI
values
across
data
sets
and
clustering
algorithms
without
fixing
k
*
,
as
VI
will
vary
over
different
ranges
.
It
is
a
trivial
modification
to
modify
VI
such
that
it
varies
over
[
0,1
]
.
Normalizing
,
VI
by
log
n
or
1
/
2
log
k
*
guarantee
this
range
.
However
,
Meila
(
2007
)
raises
two
potential
problems
with
this
modification
.
The
normalization
should
not
be
applied
if
data
sets
of
different
sizes
are
to
be
compared
—
it
negates
the
n-invariance
of
the
measure
.
Additionally
,
if
two
authors
apply
the
latter
normalization
and
do
not
use
the
same
value
for
k
*
,
their
results
will
not
be
comparable
.
While
VI
has
a
number
of
very
useful
distance
properties
when
analyzing
a
single
data
set
across
a
number
of
settings
,
it
has
limited
utility
as
a
general
purpose
clustering
evaluation
metric
for
use
across
disparate
clusterings
of
disparate
data
sets
.
Our
homogeneity
(
h
)
and
completeness
(
c
)
terms
both
range
over
[
0,1
]
and
are
completely
n-invariant
and
k
*
-
invariant
.
Furthermore
,
measuring
each
as
a
ratio
of
bit
lengths
has
greater
intuitive
appeal
than
a
more
opportunistic
normalization
.
V-measure
has
another
advantage
as
a
clustering
evaluation
measure
over
VI
and
Q0
.
By
evaluating
homogeneity
and
completeness
in
a
symmetrical
,
complementary
manner
,
the
calculation
of
V-measure
makes
their
relationship
clearly
observable
.
Separate
analyses
of
homogeneity
and
completeness
are
not
possible
with
any
other
cluster
evaluation
measure
.
Moreover
,
by
using
the
harmonic
mean
to
combine
homogeneity
and
completeness
,
V-measure
is
unique
in
that
it
can
also
prioritize
one
criterion
over
another
,
depending
on
the
clustering
task
and
goals
.
4
Comparing
Evaluation
Measures
Dom
(
2001
)
describes
a
parametric
technique
for
generating
example
clustering
solutions
.
He
then
proceeds
to
define
five
"
desirable
properties
"
that
clustering
accuracy
measures
should
display
,
based
on
the
parameters
used
to
generate
the
clustering
solution
.
To
compare
V-measure
more
directly
to
alternative
clustering
measures
,
we
evaluate
V-measure
and
other
measures
against
these
and
two
additional
desirable
properties
.
The
parameters
used
in
generating
a
clustering
solution
are
as
follows
.
•
e
Error
probability
;
e
=
e1
+
e2
+
£
3
.
•
ei
The
error
mass
within
"
useful
"
class-cluster
pairs
•
e2
The
error
mass
within
noise
clusters
•
e3
The
error
mass
within
noise
classes
The
construction
of
a
clustering
solution
begins
with
a
matching
of
"
useful
clusters
to
"
useful
classes3
.
There
are
|
Ku
|
=
|
K
|
—
|
Knoise
|
"
useful
"
clusters
and
|
Cu
|
=
|
C
|
—
|
Cnoise
|
"
useful
"
classes
.
The
claim
is
useful
classes
and
clusters
are
matched
to
each
other
and
matched
pairs
contain
more
data
points
than
unmatched
pairs
.
Probability
mass
of
1
—
e
is
evenly
distributed
across
each
match
.
Error
mass
of
ei
is
evenly
distributed
across
each
pair
3The
operation
of
this
matching
is
omitted
in
the
interest
of
space
.
Interested
readers
should
see
(
Dom
,
2001
)
.
of
non-matching
useful
class
/
cluster
pairs
.
Noise
clusters
are
those
that
contain
data
points
equally
from
each
cluster
.
Error
mass
of
e2
is
distributed
across
every
"
noise
-
cluster
/
"
useful
-
class
pair
.
We
extend
the
parameterization
technique
described
in
(
Dom
,
2001
)
in
with
|
Craoise
|
and
£
3
.
Noise
classes
are
those
that
contain
data
points
equally
from
each
cluster
.
Error
mass
of
e3
is
distributed
across
every
"
useful
-
cluster
/
"
noise
-
class
pair
.
An
example
solution
,
along
with
its
generating
parameters
is
given
in
Figure
3
.
Figure
3
:
Sample
parametric
clustering
solution
The
desirable
properties
proposed
by
Dom
are
given
as
P1-P5
in
Table
1
.
We
include
two
additional
properties
(
P6
,
P7
)
relating
the
examined
measure
value
to
the
number
of
'
noise
'
classes
and
e3
.
Table
1
:
Desirable
Properties
of
a
cluster
evaluation
measure
M
To
evaluate
how
different
clustering
measures
satisfy
each
of
these
properties
,
we
systematically
varied
each
parameter
,
keeping
|
C
|
=
5
fixed
.
We
evaluated
the
behavior
of
V-Measure
,
Rand
,
Mirkin
,
Fowlkes-Mallows
,
Gamma
,
Jaccard
,
VI
,
Q0
,
F-Measure
against
the
desirable
properties
P1-P74
.
Based
on
the
described
systematic
modification
of
each
parameter
,
only
V-measure
,
VI
and
Q0
empirically
satisfy
all
of
P1-P7
in
all
experimental
conditions
.
Full
results
reporting
how
frequently
each
evaluated
measure
satisfied
the
properties
based
on
these
experiments
can
be
found
in
table
2
.
All
evaluated
measures
satisfy
P4
and
P7
.
However
,
Rand
,
Mirkin
,
Fowlkes-Mallows
,
Gamma
,
Jac-card
and
F-Measure
all
fail
to
satisfy
P3
and
P6
in
at
least
one
experimental
configuration
.
This
indicates
that
the
number
of
'
noise
'
classes
or
clusters
can
be
increased
without
reducing
any
of
these
measures
.
This
implies
a
computational
obliviousness
to
potentially
significant
aspects
of
an
evaluated
clustering
solution
.
5
Applying
V-measure
In
this
section
,
we
present
two
clustering
experiments
.
We
describe
a
document
clustering
experiment
and
evaluate
its
results
using
V-measure
,
highlighting
the
interaction
between
homogeneity
and
completeness
.
Second
,
we
present
a
pitch
accent
type
clustering
experiment
.
We
present
results
from
both
of
these
experiments
in
order
to
show
how
V-measure
can
be
used
to
drawn
comparisons
across
data
sets
.
5.1
Document
Clustering
Clustering
techniques
have
been
used
widely
to
sort
documents
into
topic
clusters
.
We
reproduce
such
an
experiment
here
to
demonstrate
the
usefulness
of
V-measure
.
Using
a
subset
of
the
TDT-4
corpus
(
Strassel
and
Glenn
,
2003
)
(
1884
English
news
wire
and
broadcast
news
documents
manually
labeled
with
one
of
12
topics
)
,
we
ran
clustering
experiments
using
k-means
clustering
(
McQueen
,
1967
)
and
evaluated
the
results
using
V-Measure
,
VI
and
Q0
-
those
measures
that
satisfied
the
desirable
properties
defined
in
section
4
.
The
topics
and
relative
distributions
are
as
follows
:
Acts
4The
inequalities
in
the
desirable
properties
are
inverted
in
the
evaluation
of
VI
,
Q0
and
Mirkin
as
they
are
defined
as
distance
,
as
opposed
to
similarity
,
measures
.
Property
F-measure
V-Measure
Table
2
:
Rates
of
satisfaction
of
desirable
properties
Discovery
(
1.4
%
)
.
We
employed
stemmed
(
Porter
,
1980
)
,
tf
*
idf-weighted
term
vectors
extracted
for
each
document
as
the
clustering
space
for
these
experiments
,
which
yielded
a
very
high
dimension
space
.
To
reduce
this
dimensionality
,
we
performed
a
simple
feature
selection
procedure
including
in
the
feature
vector
only
those
terms
that
represented
the
highest
tf
*
idf
value
for
at
least
one
data
point
.
This
resulted
in
a
feature
vector
containing
484
tf
*
idf
values
for
each
document
.
Results
from
k-means
clustering
are
are
shown
in
Figure
4
.
number
of
clusters
Figure
4
:
Results
of
document
clustering
measured
by
V-Measure
,
VI
and
Q2
The
first
observation
that
can
be
drawn
from
these
results
is
the
degree
to
which
VI
is
dependent
on
the
number
of
clusters
(
k
)
.
This
dependency
severely
limits
the
usefulness
of
VI
:
it
is
inappropriate
in
selecting
an
appropriate
parameter
for
k
or
for
evaluating
the
distance
between
clustering
solutions
generated
using
different
values
of
k.
V-measure
and
Q2
demonstrate
similar
behavior
in
evaluating
these
experimental
results
.
They
both
reach
a
maximal
value
with
35
clusters
,
however
,
Q2
shows
a
greater
descent
as
the
number
of
clusters
increases
.
We
will
discuss
this
quality
in
greater
detail
in
section
5.2
.
5.2
Pitch
Accent
Clustering
Pitch
accent
is
how
speakers
of
many
languages
make
a
word
intonational
prominent
.
In
most
pitch
accent
languages
,
words
can
also
be
accented
in
different
ways
to
convey
different
meanings
(
Hirschberg
,
2002
)
.
In
the
ToBI
labeling
conventions
for
Standard
American
English
(
Silverman
et
al.
,
1992
)
,
for
example
,
there
are
five
different
accent
types
(
H
*
,
L
*
,
H+
!
H
*
,
L+H
*
,
L
*
+H
)
.
(
2.1
%
)
.
We
extracted
ten
acoustic
features
from
each
accented
word
to
serve
as
the
clustering
space
for
this
experiment
.
Using
Praat
's
(
Boersma
,
2001
)
Get
Pitch
(
ac
)
.
.
.
function
,
we
calculated
the
mean
F0
and
AF0
,
as
well
as
z-score
speaker
normalized
versions
of
the
same
.
We
included
in
the
feature
vector
the
relative
location
of
the
maximum
pitch
value
in
the
word
as
well
as
the
distance
between
this
max
-
5Pitch
accents
containing
a
high
tone
may
also
be
down-stepped
,
or
spoken
in
a
compressed
pitch
range
.
Here
we
collapsed
all
DOWNSTEPPED
instances
of
each
pitch
accent
with
the
corresponding
non-downstepped
instances
.
imum
and
the
point
of
maximum
intensity
.
Finally
,
we
calculated
the
raw
and
speaker
normalized
slope
from
the
start
ofthe
word
to
the
maximum
pitch
,
and
from
the
maximum
pitch
to
the
end
of
the
word
.
Using
this
feature
vector
,
we
performed
k-means
clustering
and
evaluate
how
successfully
these
dimensions
represent
differences
between
pitch
accent
types
.
The
resulting
V-measure
,
VI
and
Q0
calculations
are
shown
in
Figure
5
.
Figure
5
:
Results
of
pitch
accent
clustering
measured
by
V-Measure
,
VI
and
Q0
In
evaluating
the
results
from
these
experiments
,
Q2
and
V-measure
reveal
considerably
different
behaviors
.
Q2
shows
a
maximum
at
k
=
10
,
and
descends
at
k
increases
.
This
is
an
artifact
of
the
MDL
principle
.
Q2
makes
the
claim
that
a
clustering
solution
based
on
fewer
clusters
is
preferable
to
one
using
more
clusters
,
and
that
the
balance
between
the
number
of
clusters
and
the
conditional
entropy
,
H
(
C
|
K
)
,
should
be
measured
in
terms
of
coding
length
.
With
V-measure
,
we
present
a
different
argument
.
We
contend
that
the
a
high
value
ofk
does
not
inherently
reduce
the
goodness
of
a
clustering
solution
.
Using
these
results
as
an
example
,
we
find
that
at
approximately
30
clusters
an
increase
of
clusters
translates
to
an
increase
in
V-Measure
.
This
is
due
to
an
increased
homogeneity
(
^j^çy
)
and
a
relatively
stable
completeness
(
Hj^
}
q
)
-
That
is
,
inclusion
of
more
clusters
leads
to
clusters
with
a
more
skewed
within-cluster
distribution
and
a
equivalent
distribution
of
cluster
memberships
within
classes
.
This
is
intuitively
preferable
-
one
criterion
is
improved
,
the
other
is
not
reduced
-
despite
requiring
additional
clusters
.
This
is
an
instance
in
which
the
MDL
prin
-
ciple
limits
the
usefulness
of
Q2
.
We
again
(
see
section
5.1
)
observe
the
close
dependency
of
VI
and
k.
Moreover
,
in
considering
figures
5
and
4
,
simultaneously
,
we
see
considerably
higher
values
achieved
by
the
document
clustering
experiments
.
Given
the
naive
approaches
taken
in
these
experiments
,
this
is
expected
-
and
even
desired
-
given
the
previous
work
on
these
tasks
:
document
clustering
has
been
notably
more
successfully
applied
than
pitch
accent
clustering
.
These
examples
allow
us
to
observe
how
transparently
V-measure
can
be
used
to
compare
the
behavior
across
distinct
data
sets
.
6
Conclusion
We
have
presented
a
new
external
cluster
evaluation
measure
,
V-measure
,
and
compared
it
with
existing
clustering
evaluation
measures
.
V-measure
is
based
upon
two
criteria
for
clustering
usefulness
,
homogeneity
and
completeness
,
which
capture
a
clustering
solution
's
success
in
including
all
and
only
datapoints
from
a
given
class
in
a
given
cluster
.
We
have
also
demonstrated
V-measure
's
usefulness
in
comparing
clustering
success
across
different
domains
by
evaluating
document
and
pitch
accent
clustering
solutions
.
We
believe
that
V-measure
addresses
some
of
the
problems
that
affect
other
cluster
measures
.
1
)
It
evaluates
a
clustering
solution
independent
of
the
clustering
algorithm
,
size
of
the
data
set
,
number
of
classes
and
number
of
clusters
.
2
)
It
does
not
require
its
user
to
map
each
cluster
to
a
class
.
Therefore
,
it
only
evaluates
the
quality
of
the
clustering
,
not
a
post-hoc
class-cluster
mapping
.
3
)
It
evaluates
the
clustering
of
every
data
point
,
avoiding
the
"
problem
of
matching
"
.
4
)
By
evaluating
the
criteria
of
both
homogeneity
and
completeness
,
V-measure
is
more
comprehensive
than
those
that
evaluate
only
one
.
5
)
Moreover
,
by
evaluating
these
criteria
separately
and
explicitly
,
V-measure
can
serve
as
an
elegant
diagnositic
tool
providing
greater
insight
into
clustering
behavior
.
Acknowledgments
The
authors
thank
Kapil
Thadani
,
Martin
Jansche
and
Sasha
Blair-Goldensohn
and
for
their
feedback
.
This
work
was
funded
in
part
by
the
DARPA
GALE
program
under
a
subcontract
to
SRI
International
.
