In
this
paper
,
we
proposed
a
novel
probabilistic
generative
model
to
deal
with
explicit
multiple-topic
documents
:
Parametric
Dirichlet
Mixture
Model
(
PDMM
)
.
PDMM
is
an
expansion
of
an
existing
probabilistic
generative
model
:
Parametric
Mixture
Model
(
PMM
)
by
hierarchical
Bayes
model
.
PMM
models
multiple-topic
documents
by
mixing
model
parameters
of
each
single
topic
with
an
equal
mixture
ratio
.
PDMM
models
multiple-topic
documents
by
mixing
model
parameters
of
each
single
topic
with
mixture
ratio
following
Dirichlet
distribution
.
We
evaluate
PDMM
and
PMM
by
comparing
F-measures
using
MEDLINE
corpus
.
The
evaluation
showed
that
PDMM
is
more
effective
than
PMM
.
1
Introduction
Documents
,
such
as
those
seen
on
Wikipedia
and
Folksonomy
,
have
tended
to
be
assigned
with
explicit
multiple
topics
.
In
this
situation
,
it
is
important
to
analyze
a
linguistic
relationship
between
documents
and
the
assigned
multiple
topics
.
We
attempt
to
model
this
relationship
with
a
probabilistic
generative
model
.
A
probabilistic
generative
model
for
documents
with
multiple
topics
is
a
probability
model
of
the
process
of
generating
documents
with
multiple
topics
.
By
focusing
on
modeling
the
generation
process
of
documents
and
the
assigned
multiple
topics
,
we
can
extract
specific
properties
of
documents
and
the
assigned
multiple
topics
.
The
model
can
also
be
applied
to
a
wide
range
of
applications
such
as
automatic
categorization
for
multiple
topics
,
keyword
extraction
and
measuring
document
similarity
,
for
example
.
A
probabilistic
generative
model
for
documents
with
multiple
topics
is
categorized
into
the
following
two
models
.
One
model
assumes
a
topic
as
a
latent
topic
.
We
call
this
model
the
latent-topic
model
.
The
other
model
assumes
a
topic
as
an
explicit
topic
.
We
call
this
model
the
explicit-topic
model
.
In
a
latent-topic
model
,
a
latent
topic
indicates
not
a
concrete
topic
but
an
underlying
implicit
topic
of
documents
.
Obviously
this
model
uses
an
unsupervised
learning
algorithm
.
Representative
examples
of
this
kind
of
model
are
Latent
Dirichlet
Allocation
(
LDA
)
(
D.M.Blei
et
al.
,
2001
;
D.M.Blei
et
al.
,
2003
)
and
Hierarchical
Dirichlet
Process
(
HDP
)
(
Y.W.Teh
et
al.
,
2003
)
.
In
an
explicit-topic
model
,
an
explicit
topic
indicates
a
concrete
topic
such
as
economy
or
sports
,
for
example
.
A
learning
algorithm
for
this
model
is
a
supervised
learning
algorithm
.
That
is
,
an
explicit
topic
model
learns
model
parameter
using
a
training
data
set
of
tuples
such
as
(
documents
,
topics
)
.
Representative
examples
of
this
model
are
Parametric
the
remainder
of
this
paper
,
PMM
indicates
PMM1
because
PMM1
is
more
effective
than
PMM2
.
In
this
paper
,
we
focus
on
the
explicit
topic
model
.
In
particular
,
we
propose
a
novel
model
that
is
based
on
PMM
but
fundamentally
improved
.
The
remaining
part
of
this
paper
is
organized
as
follows
.
Sections
2
explains
terminology
used
in
the
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
421-429
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
following
sections
.
Section
3
explains
PMM
that
is
most
directly
related
to
our
work
.
Section
4
points
out
the
problem
of
PMM
and
introduces
our
new
model
.
Section
5
evaluates
our
new
model
.
Section
6
summarizes
our
work
.
2
Terminology
This
section
explains
terminology
used
in
this
paper
.
K
is
the
number
of
explicit
topics
.
V
is
the
number
of
words
in
the
vocabulary
.
v
=
{
1
,
2
,
•
•
•
,
V
}
is
a
set
of
vocabulary
index
.
y
=
{
1,2
,
•
•
•
,
K
}
is
a
set
of
topic
index
.
N
is
the
number
of
words
in
a
document
.
w
=
(
w
\
,
w2
,
•
•
•
,
wN
)
is
a
sequence
of
N
words
where
wn
denotes
the
nth
word
in
the
sequence
.
w
is
a
document
itself
and
is
called
words
vector
.
x
=
(
xi
,
x2
,
•
•
•
,
xV
)
is
a
word-frequency
vector
,
that
is
,
BOW
(
Bag
Of
Words
)
representation
where
xv
denotes
the
frequency
of
word
v.
wn
takes
a
value
of
1
(
0
)
when
wn
is
v
e
v
(
is
not
v
e
v
)
.
y
=
(
yi
,
V2
,
•
•
•
,
Vk
)
is
a
topic
vector
into
which
a
document
w
is
categorized
,
where
y
takes
a
value
of
1
(
0
)
when
the
ith
topic
is
(
not
)
assigned
with
a
document
w.
Iy
c
y
is
a
set
of
topic
index
i
,
where
yi
takes
a
value
of
1
in
y.
£
iei
and
nieIy
denote
the
sum
and
product
for
all
i
in
Iy
,
respectively
.
r
(
x
)
is
the
Gamma
function
and
\
I
&gt;
is
the
Psi
function
(
Minka
,
2002
)
.
A
probabilistic
generative
model
for
documents
with
multiple
topics
models
a
probability
of
generating
a
document
w
in
multiple
topics
y
using
model
parameter
6
,
i.e.
,
models
P
(
w
|
y
,
6
)
.
A
multiple
categorization
problem
is
to
estimate
multiple
topics
y
*
of
a
document
w
*
whose
topics
are
unknown
.
The
model
parameters
are
learned
by
documents
D
=
{
(
wd
,
yrf
)
}
(
|
L1
,
where
M
is
the
number
of
documents
.
3
Parametric
Mixture
Model
In
this
section
,
we
briefly
explain
Parametric
Mixture
Model
(
PMM
)
(
Ueda
,
N.
and
Saito
,
K.
,
2002a
;
Ueda
,
N.
and
Saito
,
K.
,
2002b
)
.
PMM
models
multiple-topic
documents
by
mixing
model
parameters
of
each
single
topic
with
an
equal
mixture
ratio
,
where
the
model
parameter
9iv
is
the
probability
that
word
v
is
generated
from
topic
i.
This
is
because
it
is
impractical
to
use
model
param
-
eter
corresponding
to
multiple
topics
whose
number
is
2K
—
1
(
all
combination
of
K
topics
)
.
PMM
achieved
more
useful
results
than
machine
learning
methods
such
as
Naive
Bayes
,
SVM
,
K-NN
and
Neural
Networks
(
Ueda
,
N.
and
Saito
,
K.
,
2002a
;
Ueda
,
N.
and
Saito
,
K.
,
2002b
)
.
PMM
employs
a
BOW
representation
and
is
formulated
as
follows
.
hi
(
y
)
is
a
mixture
ratio
corresponding
to
topic
i
and
is
formulated
as
follows
:
r^
-
,
£
.
=
l
hi
(
y
)
=
l.
3.3
Learning
Algorithm
of
Model
Parameter
The
learning
algorithm
of
model
parameter
6
in
PMM
is
an
iteration
method
similar
to
the
EM
algorithm
.
Model
parameter
6
is
estimated
by
maximizing
II^P
(
wd
|
yd,6
)
in
training
documents
D
=
,
yd
)
Function
g
corresponding
to
a
document
d
is
introduced
as
follows
:
The
parameters
are
updated
along
with
the
following
formula
.
Xdv
is
the
frequency
of
word
v
in
document
d.
C
is
the
normalization
term
for
£
6iv
=
1
.
Z
is
a
smoothing
parameter
that
is
Laplace
smoothing
when
Z
is
set
to
two
.
In
this
paper
,
Z
is
set
to
two
as
the
original
paper
.
4
Proposed
Model
In
this
section
,
firstly
,
we
mention
the
problem
related
to
PMM
.
Then
,
we
explain
our
solution
of
the
problem
by
proposing
a
new
model
.
PMM
estimates
model
parameter
6
assuming
that
all
of
mixture
ratios
of
single
topic
are
equal
.
It
is
our
intuition
that
each
document
can
sometimes
be
more
weighted
to
some
topics
than
to
the
rest
of
the
assigned
topics
.
If
the
topic
weightings
are
averaged
over
all
biases
in
the
whole
document
set
,
they
could
be
canceled
.
Therefore
,
model
parameter
6
learned
by
PMM
can
be
reasonable
over
the
whole
of
documents
.
However
,
if
we
compute
the
probability
of
generating
an
individual
document
,
a
document-specific
topic
weight
bias
on
mixture
ratio
is
to
be
considered
.
The
proposed
model
takes
into
account
this
document-specific
bias
by
assuming
that
mixture
ratio
vector
n
follows
Dirichlet
distribution
.
This
is
because
we
assume
the
sum
of
the
element
in
vector
n
is
one
and
each
element
ni
is
nonnegative
.
Namely
,
the
proposed
model
assumes
model
parameter
of
multiple
topics
as
a
mixture
of
model
parameter
on
each
single
topic
with
mixture
ratio
following
Dirichlet
distribution
.
Concretely
,
given
a
document
w
and
multiple
topics
y
,
it
estimates
a
posterior
probability
distribution
P
(
n
|
x
,
y
)
by
Bayesian
inference
.
For
convenience
,
the
proposed
model
is
called
PDMM
(
Parametric
Dirichlet
Mixture
Model
)
.
In
Figure
1
,
the
mixture
ratio
(
bias
)
n
=
(
ni
,
n2
,
n3
)
,
£
3
=
1
ni
=
1
,
n
&gt;
0
of
three
topics
is
expressed
in
3-dimensional
real
space
R3
.
The
mixture
ratio
(
bias
)
n
constructs
2D-simplex
in
R3
.
One
point
on
the
simplex
indicates
one
mixture
ratio
n
of
the
three
topics
.
That
is
,
the
point
indicates
multiple
topics
with
the
mixture
ratio
.
PMM
generates
documents
assuming
that
each
mixture
ratio
is
equal
.
That
is
,
PMM
generates
only
documents
with
multiple
topics
that
indicates
the
center
point
of
the
2D-simplex
in
Figure
1
.
On
the
contrary
,
PDMM
generates
documents
assuming
that
mixture
ratio
n
follows
Dirichlet
distribution
.
That
is
,
PDMM
can
generate
documents
with
multiple
topics
whose
weights
can
be
generated
by
Dirichlet
distribution
.
generated
from
PDMIV
Topic
2
genereted
from
PMM
Figure
1
:
Topic
Simplex
for
Three
Topics
bution
of
n
whose
index
i
is
an
element
of
Iy
,
i.e.
,
i
e
Iy
.
We
use
Dirichlet
distribution
as
the
prior
.
a
is
a
parameter
vector
of
Dirichlet
distribution
corresponding
to
e
Iy
)
.
Namely
,
the
formulation
is
as
follows
.
t
£
&gt;
(
v
,
y
,
6
,
n
)
is
the
probability
that
word
v
is
generated
from
multiple
topics
y
and
is
denoted
as
a
linear
sum
of
e
Iy
)
and
9iv
(
i
e
Iy
)
as
follows
.
4.3
Variational
Bayes
Method
for
Estimating
Mixture
Ratio
This
section
explains
a
method
to
estimate
the
posterior
probability
distribution
P
(
n
|
w
,
y
,
a
,
6
)
of
a
document-specific
mixture
ratio
.
Basically
,
P
(
n
|
w
,
y
,
a
,
6
)
is
obtained
by
Bayes
theorem
using
Eq
.
(
4
)
.
However
,
that
is
computationally
impractical
because
a
complicated
integral
computation
is
needed
.
Therefore
we
estimate
an
approximate
distribution
of
P
(
n
|
w
,
y
,
a
,
6
)
using
Variational
Bayes
Method
(
H.Attias
,
1999
)
.
The
concrete
explanation
is
as
follows
Use
Eqs
.
(
4
)
(
7
)
.
)
is
Dirichlet
distribution
where
7
is
its
paP
(
w
,
a
,
6
)
=
rameter
.
Q
(
zn
|
0
)
is
Multinomial
distribution
where
P
(
n
|
a
,
y
)
nV
=
1^^
P
(
y
=
1
|
n
)
P
=
1,0
)
)
Xv
&lt;
/
&gt;
ni
is
its
parameter
and
indicates
the
probability
ieiy
that
the
nth
word
of
a
document
is
topic
i
,
i.e.
Transform
document
expression
of
above
equa
-
.
"
„
,
.
.
.
P
(
n
|
a
,
y
)
In
=
l
P
(
yin
=
1
|
n
)
P
(
wn
|
yin
=
follows
.
zieIy
z2eIy
zneIy
KL
(
Q
,
P
)
is
the
Kullback-Leibler
Divergence
Eq
.
(
8
)
is
regarded
as
Eq
.
(
4
)
rewritten
by
introducing
that
is
often
employed
as
a
distance
between
P
(
n
,
z
,
w
|
y
,
a,6
)
P
(
n
,
z
|
w
,
y
,
a
,
6
)
.
Hereafter
,
we
explain
Variational
Bayes
Method
for
estimating
an
approximate
distribution
of
eter
7
and
0
,
by
maximizing
f
[
Q
]
as
follows
.
Using
Eqs
.
(
10
)
(
11
)
.
introduced
.
F
[
Q
]
is
known
to
be
a
function
of
Yi
and
0ni
from
Eqs
.
(
21
)
through
(
25
)
.
Then
we
only
need
to
resolve
the
maximization
problem
of
nonlinear
function
F
[
Q
]
with
respect
to
Yi
and
0ni
.
In
this
case
,
the
maximization
problem
can
be
resolved
by
Lagrange
multiplier
method
.
First
,
regard
F
[
Q
]
as
a
function
of
Yi
,
which
is
denoted
as
F
[
Yi
]
.
Then
,
Yi
does
not
have
constraints
.
Therefore
we
only
need
to
find
the
following
Yi
,
where
^dj,7
'
]
=
0
.
The
resultant
Yi
is
expressed
as
follows
.
A
is
a
so-called
Lagrange
multiplier
.
We
find
the
following
0ni
where
dIQ^^
=
0
.
C
is
a
normalization
term
.
By
Eqs
.
(
26
)
(
28
)
,
we
obtain
the
following
updating
formulas
of
Yi
and
0ni
.
Using
the
above
updating
formulas
,
we
can
estimate
parameters
7
and
0
,
which
are
specific
to
a
document
w
and
topics
y.
Last
of
all
,
we
show
a
pseudo
code
:
vb
(
w
,
y
)
which
estimates
7
and
0
.
In
addition
,
we
regard
a
,
which
is
a
parameter
of
a
prior
distribution
of
n
,
as
a
vector
whose
elements
are
all
one
.
That
is
because
Dirichlet
distribution
where
each
parameter
is
one
becomes
Uniform
distribution
.
•
Variational
Bayes
Method
for
PDMM
---
4.4
Computing
Probability
of
Generating
Document
PMM
computes
a
probability
of
generating
a
document
w
on
topics
y
and
a
set
of
model
parameter
6
as
follows
:
4.5
Algorithm
for
Estimating
Multiple
Topics
of
Document
PDMM
estimates
multiple
topics
y
*
maximizing
a
probability
of
generating
a
document
w
*
,
i.e.
,
Eq
.
(
35
)
.
This
is
the
0-1
integer
problem
(
i.e.
,
NP-hard
problem
)
,
so
PDMM
uses
the
same
approximate
estimation
algorithm
as
PMM
does
.
But
it
is
different
from
PMM
's
estimation
algorithm
in
that
it
estimates
the
mixture
ratios
of
topics
y
by
Varia-tional
Bayes
Method
as
shown
by
vb
(
w
,
y
)
at
step
6
in
the
following
pseudo
code
of
the
estimation
algorithm
:
•
Topics
Estimation
Algorithm
----
function
prediction
(
w
)
:
5
Evaluation
We
evaluate
the
proposed
model
by
using
F-measure
of
multiple
topics
categorization
problem
.
We
use
MEDLINE1
as
a
dataset
.
In
this
experiment
,
we
use
five
thousand
abstracts
written
in
English
.
MEDLINE
has
a
metadata
set
called
MeSH
Term
.
For
example
,
each
abstract
has
MeSH
Terms
such
as
RNA
Messenger
and
DNA-Binding
Proteins
.
MeSH
Terms
are
regarded
as
multiple
topics
of
an
abstract
.
In
this
regard
,
however
,
we
use
MeSH
Terms
whose
frequency
are
medium
(
100-999
)
.
We
did
that
because
the
result
of
experiment
can
be
overly
affected
by
such
high
frequency
terms
that
appear
in
almost
every
abstract
and
such
low
frequency
terms
that
appear
in
very
few
abstracts
.
In
consequence
,
the
number
of
topics
is
88
.
The
size
of
vocabulary
is
46,075
.
The
proportion
of
documents
with
multiple
topics
on
the
whole
dataset
is
69.8
%
,
i.e.
,
that
of
documents
with
single
topic
is
30.2
%
.
The
average
of
the
number
of
topics
of
a
document
is
3.4
.
Using
TreeTag-ger2
,
we
lemmatize
every
word
.
We
eliminate
stop
words
such
as
articles
and
be-verbs
.
We
compare
F-measure
of
PDMM
with
that
of
PMM
and
other
models
.
F-measure
(
F
)
is
as
follows
:
F
=
2PR
P
=
|
NrnWe
|
R
=
|
NrfWe
|
F
=
P+R
,
P
=
|
Ne
|
,
R
=
|
Nr
|
.
Nr
is
a
set
of
relevant
topics
.
Ne
is
a
set
of
estimated
topics
.
A
higher
F-measure
indicates
a
better
ability
to
discriminate
topics
.
In
our
experiment
,
we
compute
F-measure
in
each
document
and
average
the
F-measures
throughout
the
whole
document
set
.
We
consider
some
models
that
are
distinct
in
learning
model
parameter
6
.
PDMM
learns
model
parameter
6
by
the
same
learning
algorithm
as
PMM
.
NBM
learns
model
parameter
6
by
Naive
Bayes
learning
algorithm
.
The
parameters
are
updated
according
to
the
following
formula
:
0iv
=
—
C
—
•
Miv
is
the
number
of
training
documents
where
a
word
v
appears
in
topic
i.
C
is
normalization
term
for
E
V
=
1
0iv
=
1
.
1http
:
/
/
www.nlm.nih.gov
/
pubs
/
factsheets
/
medline.html
2http
:
/
/
www.ims.uni-stuttgart.de
/
projekte
/
corplex
/
TreeTagger
/
The
comparison
of
these
models
with
respect
to
F-measure
is
shown
in
Figure
2
.
The
horizontal
axis
is
the
proportion
of
test
data
of
dataset
(
5,000
abstracts
)
.
For
example
,
2
%
indicates
that
the
number
of
documents
for
learning
model
is
4,900
and
the
number
of
documents
for
the
test
is
100
.
The
vertical
axis
is
F-measure
.
In
each
proportion
,
F-measure
is
an
average
value
computed
from
five
pairs
of
training
documents
and
test
documents
randomly
generated
from
dataset
.
F-measure
of
PDMM
is
higher
than
that
of
other
methods
on
any
proportion
,
as
shown
in
Figure
2
.
Therefore
,
PDMM
is
more
effective
than
other
methods
on
multiple
topics
categorization
.
Figure
3
shows
the
comparison
of
models
with
respect
to
F-measure
,
changing
proportion
of
multiple
topic
document
for
the
whole
dataset
.
The
proportion
of
document
for
learning
and
test
are
40
%
and
60
%
,
respectively
.
The
horizontal
axis
is
the
proportion
of
multiple
topic
document
on
the
whole
dataset
.
For
example
,
30
%
indicates
that
the
proportion
of
multiple
topic
document
is
30
%
on
the
whole
dataset
and
the
remaining
documents
are
single
topic
,
that
is
,
this
dataset
is
almost
single
topic
document
.
In
30
%
.
there
is
little
difference
of
F-measure
among
models
.
As
the
proportion
of
multiple
topic
and
single
topic
document
approaches
90
%
,
that
is
,
multiple
topic
document
,
the
differences
of
F-measure
among
models
become
apparent
.
This
result
shows
that
PDMM
is
effective
in
modeling
multiple
topic
document
.
Figure
2
:
F-measure
Results
In
the
results
of
experiment
described
in
section
5.2
,
PDMM
is
more
effective
than
other
models
in
Figure
3
:
F-measure
Results
changing
Proportion
of
Multiple
Topic
Document
for
Dataset
multiple-topic
categorization
.
If
the
topic
weightings
are
averaged
over
all
biases
in
the
whole
of
training
documents
,
they
could
be
canceled
.
This
cancellation
can
lead
to
the
result
that
model
parameter
6
learned
by
PMM
is
reasonable
over
the
whole
of
documents
.
Moreover
,
PDMM
computes
the
probability
of
generating
a
document
using
a
mixture
of
model
parameter
,
estimating
the
mixture
ratio
of
topics
.
This
estimation
of
the
mixture
ratios
,
we
think
,
is
the
key
factor
to
achieve
the
results
better
than
other
models
.
In
addition
,
the
estimation
of
a
mixture
ratio
of
topics
can
be
effective
from
the
perspective
of
extracting
features
of
a
document
with
multiple
topics
.
A
mixture
ratio
of
topics
assigned
to
a
document
is
specific
to
the
document
.
Therefore
,
the
estimation
of
the
mixture
ratio
of
topics
is
regarded
as
a
projection
from
a
word-frequency
space
of
qv
where
q
is
a
set
of
integer
number
to
a
mixture
ratio
space
of
topics
[
0,1
]
K
in
a
document
.
Since
the
size
of
vocabulary
is
much
more
than
that
of
topics
,
the
estimation
of
the
mixture
ratio
of
topics
is
regarded
as
a
dimension
reduction
and
an
extraction
of
features
in
a
document
.
This
can
lead
to
analysis
of
similarity
among
documents
with
multiple
topics
.
For
example
,
the
estimated
mixture
ratio
of
topics
[
Comparative
Study
]
C
[
Apoptosis
]
and
[
Models
,
Biological
]
in
one
MEDLINE
abstract
is
0.656C0.176
and
0.168
,
respectively
.
This
ratio
can
be
a
feature
of
this
document
.
Moreover
,
we
can
obtain
another
interesting
results
as
follows
.
The
estimation
of
mixture
ratios
of
topics
uses
parameter
7
in
section
4.3
.
We
obtain
interesting
results
from
another
parameter
0
that
needs
to
estimate
7
.
0ni
is
specific
to
a
document
.
biomarkers
Fusarium
non-Gaussian
Stachybotrys
Cladosporium
population
response
dampness
0ni
indicates
the
probability
that
a
word
wn
belongs
to
topic
i
in
a
document
.
Therefore
we
can
compute
the
entropy
on
wn
as
follows
:
entropy
(
wn
)
=
E
£
i
0ni
log
(
0ni
)
We
rank
words
in
a
document
by
this
entropy
.
For
example
,
a
list
of
words
in
ascending
order
of
the
entropy
in
document
X
is
shown
in
Table
1
.
A
value
in
parentheses
is
a
ranking
of
words
in
decending
order
of
TF-IDF
(
=
tf
•
log
(
M
/
df
)
,
where
tf
is
term
frequency
in
a
test
document
,
df
is
document
frequency
and
M
is
the
number
of
documents
in
the
set
of
doucuments
for
learning
model
parameters
)
(
Y.
Yang
and
J.
Pederson
,
1997
)
.
The
actually
assigned
topics
are
[
Female
]
,
[
Male
]
and
[
Biological
Markers
]
,
where
each
estimated
mixture
ratio
is
0.499
,
0.460
and
0.041
,
respectively
.
The
top
10
words
seem
to
be
more
technical
than
the
bottom
10
words
in
Table
1
.
When
the
entropy
of
a
word
is
lower
,
the
word
is
more
topic-specific
oriented
,
i.e.
,
more
technical
.
In
addition
,
this
ranking
of
words
depends
on
topics
assigned
to
a
document
.
When
we
assign
randomly
chosen
topics
to
the
same
document
,
generic
terms
might
be
ranked
higher
.
For
example
,
when
we
rondomly
assign
the
topics
[
Rats
]
,
[
Child
]
and
[
Incidence
]
,
generic
terms
such
as
"
use
"
and
"
relate
"
are
ranked
higher
as
shown
in
Table
2
.
The
estimated
mixture
ratio
of
[
Rats
]
,
[
Child
]
and
[
Incidence
]
is
0.411
,
0.352
and
0.237
,
respectively
.
For
another
example
,
a
list
of
words
in
ascending
order
of
the
entropy
in
document
Y
is
shown
in
Table
3
.
The
actually
assigned
topics
are
Female
,
Animals
,
Pregnancy
and
Glucose
.
.
The
estimated
mixture
ratio
of
[
Female
]
,
[
Animals
]
,
[
Pregnancy
]
and
exposure
distribution
evaluate
versicolor
Aspergillus
correlate
chrysogenum
positive
chartarum
herbarum
[
Glucose
]
is
0.442
,
0.437
,
0.066
and
0.055
,
respectively
In
this
case
,
we
consider
assigning
sub
topics
of
actual
topics
to
the
same
document
Y.
Table
4
shows
a
list
of
words
in
document
Y
assigned
with
the
sub
topics
[
Female
]
and
[
Animals
]
.
The
estimated
mixture
ratio
of
[
Female
]
and
[
Animals
]
is
0.495
and
0.505
,
respectively
.
Estimated
mixture
ratio
of
topics
is
chaged
.
It
is
interesting
that
[
Female
]
has
higher
mixture
ratio
than
[
Animals
]
in
actual
topics
but
[
Female
]
has
lower
mixture
ratio
than
[
Animals
]
in
sub
topics
[
Female
]
and
[
Animals
]
.
According
to
these
different
mixture
ratios
,
the
ranking
of
words
in
docment
Y
is
changed
.
Table
5
shows
a
list
of
words
in
document
Y
assigned
with
the
sub
topics
[
Pregnancy
]
and
[
Glucose
]
.
The
estimated
mixture
ratio
of
[
Pregnancy
]
and
[
Glucose
]
is
0.502
and
0.498
,
respectively
.
It
is
interesting
that
in
actual
topics
,
the
ranking
of
gglucose-insulinh
and
"
IVGTT
"
is
high
in
document
Y
but
in
the
two
subset
of
actual
topics
,
gglucose-insulinh
and
"
IVGTT
"
cannot
be
find
in
Top
10
words
.
The
important
observation
known
from
these
examples
is
that
this
ranking
method
of
words
in
a
document
can
be
assosiated
with
topics
assigned
to
the
document
.
0
depends
on
7
seeing
Eq
.
(
28
)
.
This
is
because
the
ranking
of
words
depends
on
assigned
topics
,
concretely
,
mixture
ratios
of
assigned
topics
.
TF-IDF
computed
from
the
whole
documents
cannot
have
this
property
.
Combined
with
existing
the
extraction
method
of
keywords
,
our
model
has
the
potential
to
extract
document-specific
keywords
using
information
of
assigned
topics
.
Table
3
:
Word
List
of
Document
Y
whose
Actual
Topics
are
[
Femaile
]
,
[
Animals
]
,
[
Pregnancy
]
and
[
Glucose
]
glucose-insulin
indicate
Table
4
:
Word
List
of
Document
Y
whose
Topics
are
[
Femaile
]
and
[
Animals
]
insulin-signaling
euthanasia
undernutrition
conclusion
6
Concluding
Remarks
We
proposed
and
evaluated
a
novel
probabilistic
generative
models
,
PDMM
,
to
deal
with
multiple-topic
documents
.
We
evaluated
PDMM
and
other
models
by
comparing
F-measure
using
MEDLINE
corpus
.
The
results
showed
that
PDMM
is
more
effective
than
PMM
.
Moreover
,
we
indicate
the
potential
of
the
proposed
model
that
extracts
document-specific
keywords
using
information
of
assigned
topics
.
Acknowledgement
This
research
was
funded
in
part
by
MEXT
Grant-in-Aid
for
Scientific
Research
on
Priority
Areas
"
i-explosion
"
in
Japan
.
Table
5
:
Word
List
of
Document
Y
whose
Topics
are
[
Pregnancy
]
and
[
Glucose
]
metabolism
requirement
metabolic
intermediary
pregnant
prenatal
nutrition
gestation
nutrient
offspring
singleton
Learning
(
Information
Science
and
Statistics
)
,
p.687
.
Springer-Verlag
.
2001
.
Neural
Information
Processing
Systems
14
.
D.M.
Blei
,
Andrew
Y.
Ng
,
and
M.I.
Jordan
.
Latent
Dirichlet
Allocation
.
Journal
ofMachine
Learning
Research
,
vol.3
,
pp.993-1022
.
Minka
2002
.
Estimating
a
Dirichlet
distribution
.
Technical
Report
.
Y.W.Teh
,
M.IJordan
,
M.J.Beal
,
and
D.M.Blei
.
2003
.
Hierarchical
dirichlet
processes
.
Technical
Report
653
,
Department
Of
Statistics
,
UC
Berkeley
.
Parametric
mixture
models
for
multi-topic
text
.
Neural
Information
Processing
Systems
15
.
Ueda
,
N.
and
Saito
,
K.
2002
.
Singleshot
detection
of
multi-category
text
using
parametric
mixture
models
.
ACM
SIG
Knowledge
Discovery
and
Data
Mining
.
Y.
Yang
and
J.
Pederson
1997
.
A
comparative
study
on
feature
selection
in
text
categorization
.
Proc
.
International
Conference
on
Machine
Learning
.
