In this paper, we proposed a novel probabilistic generative model to deal with explicit multiple-topic documents: Parametric Dirichlet Mixture Model(PDMM).
PDMM is an expansion of an existing probabilistic generative model: Parametric Mixture Model(PMM) by hierarchical Bayes model.
PMM models multiple-topic documents by mixing model parameters of each single topic with an equal mixture ratio.
PDMM models multiple-topic documents by mixing model parameters of each single topic with mixture ratio following Dirichlet distribution.
We evaluate PDMM and PMM by comparing F-measures using MEDLINE corpus.
The evaluation showed that PDMM is more effective than PMM.
1 Introduction
Documents, such as those seen on Wikipedia and Folksonomy, have tended to be assigned with explicit multiple topics.
In this situation, it is important to analyze a linguistic relationship between documents and the assigned multiple topics .
We attempt to model this relationship with a probabilistic generative model.
A probabilistic generative model for documents with multiple topics is a probability model of the process of generating documents with multiple topics.
By focusing on modeling the generation process of documents and the assigned multiple topics, we can extract specific properties of documents and the assigned multiple topics.
The model
can also be applied to a wide range of applications such as automatic categorization for multiple topics, keyword extraction and measuring document similarity, for example.
A probabilistic generative model for documents with multiple topics is categorized into the following two models.
One model assumes a topic as a latent topic.
We call this model the latent-topic model.
The other model assumes a topic as an explicit topic.
We call this model the explicit-topic model.
In a latent-topic model, a latent topic indicates not a concrete topic but an underlying implicit topic of documents.
Obviously this model uses an unsupervised learning algorithm.
Representative examples of this kind of model are Latent Dirichlet Allocation(LDA)(D.M.Blei et al., 2001; D.M.Blei et al., 2003) and Hierarchical Dirichlet Process(HDP)(Y.W.Teh et al., 2003).
In an explicit-topic model, an explicit topic indicates a concrete topic such as economy or sports, for example.
A learning algorithm for this model is a supervised learning algorithm.
That is, an explicit topic model learns model parameter using a training data set of tuples such as (documents, topics).
Representative examples of this model are Parametric
the remainder of this paper, PMM indicates PMM1 because PMM1 is more effective than PMM2.
In this paper, we focus on the explicit topic model.
In particular, we propose a novel model that is based on PMM but fundamentally improved.
The remaining part of this paper is organized as follows.
Sections 2 explains terminology used in the
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 421-429, Prague, June 2007.
©2007 Association for Computational Linguistics
following sections.
Section 3 explains PMM that is most directly related to our work.
Section 4 points out the problem of PMM and introduces our new model.
Section 5 evaluates our new model.
Section 6 summarizes our work.
2 Terminology
This section explains terminology used in this paper.
K is the number of explicit topics.
V is the number of words in the vocabulary. v = {1, 2, • • • , V} is a set of vocabulary index. y = {1,2, • • • , K} is a set of topic index.
N is the number of words in a document. w = (w\, w2, • • • , wN) is a sequence of N words where wn denotes the nth word in the sequence. w is a document itself and is called words vector. x = (xi, x2, • • • , xV) is a word-frequency vector, that is, BOW(Bag Of Words) representation where xv denotes the frequency of word v. wn takes a value of 1(0) when wn is v e v (is not v e v). y = (yi,V2, ••• ,Vk ) is a topic vector into which a document w is categorized, where y takes a value of 1(0) when the ith topic is (not) assigned with a document w. Iy c y is a set of topic index i, where yi takes a value of 1 in y. £iei and nieIy denote the sum and product for all i in Iy, respectively. r(x) is the Gamma function and \I> is the Psi function(Minka, 2002).
A probabilistic generative model for documents with multiple topics models a probability of generating a document w in multiple topics y using model parameter 6, i.e., models P(w|y, 6).
A multiple categorization problem is to estimate multiple topics y* of a document w* whose topics are unknown.
The model parameters are learned by documents D = {(wd, yrf)}(|L1, where M is the number of documents.
3 Parametric Mixture Model
In this section, we briefly explain Parametric Mixture Model(PMM)(Ueda, N. and Saito, K., 2002a; Ueda, N. and Saito, K., 2002b).
PMM models multiple-topic documents by mixing model parameters of each single topic with an equal mixture ratio, where the model parameter 9iv is the probability that word v is generated from topic i. This is because it is impractical to use model param-
eter corresponding to multiple topics whose number is 2K — 1(all combination of K topics).
PMM achieved more useful results than machine learning methods such as Naive Bayes, SVM, K-NN and Neural Networks (Ueda, N. and Saito, K., 2002a; Ueda, N. and Saito, K., 2002b).
PMM employs a BOW representation and is formulated as follows.
hi(y) is a mixture ratio corresponding to topic i and is formulated as follows:
r^-,£ .
=l hi(y) = l.
3.3 Learning Algorithm of Model Parameter
The learning algorithm of model parameter 6 in PMM is an iteration method similar to the EM algorithm.
Model parameter 6 is estimated by maximizing II^P(wd|yd,6) in training documents D = , yd)Function g corresponding to a document d is introduced as follows:
The parameters are updated along with the following formula.
Xdv is the frequency of word v in document d. C is the normalization term for £6iv = 1.
Z is a smoothing parameter that is Laplace smoothing when Z is set to two.
In this paper, Z is set to two as the original paper.
4 Proposed Model
In this section, firstly, we mention the problem related to PMM.
Then, we explain our solution of the problem by proposing a new model.
PMM estimates model parameter 6 assuming that all of mixture ratios of single topic are equal.
It is our intuition that each document can sometimes be more weighted to some topics than to the rest of the assigned topics.
If the topic weightings are averaged over all biases in the whole document set, they could be canceled.
Therefore, model parameter 6 learned by PMM can be reasonable over the whole of documents.
However, if we compute the probability of generating an individual document, a document-specific topic weight bias on mixture ratio is to be considered.
The proposed model takes into account this document-specific bias by assuming that mixture ratio vector n follows Dirichlet distribution.
This is because we assume the sum of the element in vector n is one and each element ni is nonnegative.
Namely, the proposed model assumes model parameter of multiple topics as a mixture of model parameter on each single topic with mixture ratio following Dirichlet distribution.
Concretely, given a document w and multiple topics y , it estimates a posterior probability distribution P(n|x,y) by Bayesian inference.
For convenience, the proposed model is called PDMM(Parametric Dirichlet Mixture Model).
In Figure 1, the mixture ratio(bias) n = (ni, n2, n3), £3=1 ni = 1, n > 0 of three topics is expressed in 3-dimensional real space R3.
The mixture ratio(bias) n constructs 2D-simplex in R3.
One point on the simplex indicates one mixture ratio n of the three topics.
That is, the point indicates multiple topics with the mixture ratio.
PMM generates documents assuming that each mixture ratio is equal.
That is, PMM generates only documents with multiple topics that indicates the center point of the 2D-simplex in Figure 1.
On the contrary, PDMM generates documents assuming that mixture ratio n follows Dirichlet distribution.
That is, PDMM can generate documents with multiple topics whose weights can be generated by Dirichlet distribution.
generated from PDMIV Topic 2 genereted from PMM
Figure 1: Topic Simplex for Three Topics
bution of n whose index i is an element of Iy, i.e., i e Iy.
We use Dirichlet distribution as the prior. a is a parameter vector of Dirichlet distribution corresponding to e Iy).
Namely, the formulation is as follows.
t£>(v, y, 6, n) is the probability that word v is generated from multiple topics y and is denoted as a linear sum of e Iy) and 9iv (i e Iy) as follows.
4.3 Variational Bayes Method for Estimating Mixture Ratio
This section explains a method to estimate the posterior probability distribution P(n|w,y, a, 6) of a document-specific mixture ratio.
Basically, P(n|w, y, a, 6) is obtained by Bayes theorem using Eq.
(4).
However, that is computationally impractical because a complicated integral computation is needed.
Therefore we estimate an approximate distribution of P(n|w, y, a, 6) using Variational Bayes Method(H.Attias, 1999).
The concrete explanation is as follows
Use Eqs.
(4)(7). ) is Dirichlet distribution where 7 is its paP(w, a, 6) = rameter.
Q(zn|0) is Multinomial distribution where P(n|a,y)nV=1^^ P(y = 1|n)P= 1,0))Xv </>ni is its parameter and indicates the probability ieiy that the nth word of a document is topic i, i.e.
Transform document expression of above equa- .
" „, . . .
P(n | a,y)In=l P(yin = 1 |n)P(wn |yin = follows.
zieIy z2eIy zneIy KL(Q, P) is the Kullback-Leibler Divergence
Eq.
(8) is regarded as Eq.
(4) rewritten by introducing that is often employed as a distance between
P(n,z,w|y,a,6) P(n, z|w, y, a, 6).
Hereafter, we explain Variational Bayes Method for estimating an approximate distribution of
eter 7 and 0, by maximizing f[Q] as follows.
Using Eqs.
(10)(11).
introduced.
F [Q] is known to be a function of Yi and 0ni from Eqs.
(21) through (25).
Then we only need to resolve the maximization problem of nonlinear function F[Q] with respect to Yi and 0ni.
In this case, the maximization problem can be resolved by Lagrange multiplier method.
First, regard F [Q] as a function of Yi, which is denoted as F[Yi].
Then ,Yi does not have constraints.
Therefore we only need to find the following Yi, where ^dj,7'] = 0.
The resultant Yi is expressed as follows.
A is a so-called Lagrange multiplier.
We find the following 0ni where dIQ^^ = 0.
C is a normalization term.
By Eqs.
(26)(28), we obtain the following updating formulas of Yi and 0ni.
Using the above updating formulas , we can estimate parameters 7 and 0, which are specific to a document w and topics y. Last of all , we show a pseudo code :vb(w, y) which estimates 7 and 0.
In addition , we regard a , which is a parameter of a prior distribution of n, as a vector whose elements are all one.
That is because Dirichlet distribution where each parameter is one becomes Uniform distribution.
• Variational Bayes Method for PDMM---
4.4 Computing Probability of Generating Document
PMM computes a probability of generating a document w on topics y and a set of model parameter 6 as follows:
4.5 Algorithm for Estimating Multiple Topics of Document
PDMM estimates multiple topics y* maximizing a probability of generating a document w*, i.e., Eq.
(35).
This is the 0-1 integer problem(i.e., NP-hard problem), so PDMM uses the same approximate estimation algorithm as PMM does.
But it is different from PMM's estimation algorithm in that it estimates the mixture ratios of topics y by Varia-tional Bayes Method as shown by vb(w,y) at step 6 in the following pseudo code of the estimation algorithm:
• Topics Estimation Algorithm----
function prediction(w):
5 Evaluation
We evaluate the proposed model by using F-measure of multiple topics categorization problem.
We use MEDLINE1 as a dataset.
In this experiment, we use five thousand abstracts written in English.
MEDLINE has a metadata set called MeSH Term.
For example, each abstract has MeSH Terms such as RNA Messenger and DNA-Binding Proteins.
MeSH Terms are regarded as multiple topics of an abstract.
In this regard, however, we use MeSH Terms whose frequency are medium(100-999).
We did that because the result of experiment can be overly affected by such high frequency terms that appear in almost every abstract and such low frequency terms that appear in very few abstracts.
In consequence, the number of topics is 88.
The size of vocabulary is 46,075.
The proportion of documents with multiple topics on the whole dataset is 69.8%, i.e., that of documents with single topic is 30.2%.
The average of the number of topics of a document is 3.4.
Using TreeTag-ger2, we lemmatize every word.
We eliminate stop words such as articles and be-verbs.
We compare F-measure of PDMM with that of PMM and other models.
F-measure(F) is as follows:
F = 2PR P = |NrnWe | R = |NrfWe | F = P+R , P = |Ne| , R = |Nr | .
Nr is a set of relevant topics .
Ne is a set of estimated topics.
A higher F-measure indicates a better ability to discriminate topics.
In our experiment, we compute F-measure in each document and average the F-measures throughout the whole document set.
We consider some models that are distinct in learning model parameter 6.
PDMM learns model parameter 6 by the same learning algorithm as PMM.
NBM learns model parameter 6 by Naive Bayes learning algorithm.
The parameters are updated according to the following formula: 0iv =
—C—• Miv is the number of training documents where a word v appears in topic i. C is normalization term for E V=1 0iv = 1.
1http://www.nlm.nih.gov/pubs/factsheets/medline.html 2http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
The comparison of these models with respect to F-measure is shown in Figure 2.
The horizontal axis is the proportion of test data of dataset(5,000 abstracts).
For example, 2% indicates that the number of documents for learning model is 4,900 and the number of documents for the test is 100.
The vertical axis is F-measure.
In each proportion, F-measure is an average value computed from five pairs of training documents and test documents randomly generated from dataset.
F-measure of PDMM is higher than that of other methods on any proportion, as shown in Figure 2.
Therefore, PDMM is more effective than other methods on multiple topics categorization.
Figure 3 shows the comparison of models with respect to F-measure, changing proportion of multiple topic document for the whole dataset.
The proportion of document for learning and test are 40% and 60%, respectively.
The horizontal axis is the proportion of multiple topic document on the whole dataset.
For example, 30% indicates that the proportion of multiple topic document is 30% on the whole dataset and the remaining documents are single topic , that is, this dataset is almost single topic document.
In 30%. there is little difference of F-measure among models.
As the proportion of multiple topic and single topic document approaches 90%, that is, multiple topic document, the differences of F-measure among models become apparent.
This result shows that PDMM is effective in modeling multiple topic document.
Figure 2: F-measure Results
In the results of experiment described in section 5.2, PDMM is more effective than other models in
Figure 3: F-measure Results changing Proportion of Multiple Topic Document for Dataset
multiple-topic categorization.
If the topic weightings are averaged over all biases in the whole of training documents, they could be canceled.
This cancellation can lead to the result that model parameter 6 learned by PMM is reasonable over the whole of documents.
Moreover, PDMM computes the probability of generating a document using a mixture of model parameter, estimating the mixture ratio of topics.
This estimation of the mixture ratios, we think, is the key factor to achieve the results better than other models.
In addition, the estimation of a mixture ratio of topics can be effective from the perspective of extracting features of a document with multiple topics.
A mixture ratio of topics assigned to a document is specific to the document.
Therefore, the estimation of the mixture ratio of topics is regarded as a projection from a word-frequency space of qv where q is a set of integer number to a mixture ratio space of topics [0,1]K in a document.
Since the size of vocabulary is much more than that of topics, the estimation of the mixture ratio of topics is regarded as a dimension reduction and an extraction of features in a document.
This can lead to analysis of similarity among documents with multiple topics.
For example, the estimated mixture ratio of topics [Comparative Study]C[Apoptosis] and [Models,Biological] in one MEDLINE abstract is 0.656C0.176 and 0.168, respectively.
This ratio can be a feature of this document.
Moreover, we can obtain another interesting results as follows.
The estimation of mixture ratios of topics uses parameter 7 in section 4.3.
We obtain interesting results from another parameter 0 that needs to estimate 7.
0ni is specific to a document.
biomarkers
Fusarium
non-Gaussian
Stachybotrys
Cladosporium
population
response
dampness
0ni indicates the probability that a word wn belongs to topic i in a document.
Therefore we can compute the entropy on wn as follows:
entropy(wn) = E£i 0ni log(0ni) We rank words in a document by this entropy.
For example, a list of words in ascending order of the entropy in document X is shown in Table 1.
A value in parentheses is a ranking of words in decending order of TF-IDF(= tf • log(M/df ),where tf is term frequency in a test document, df is document frequency and M is the number of documents in the set of doucuments for learning model parameters) (Y. Yang and J. Pederson, 1997).
The actually assigned topics are [Female] , [Male] and [Biological Markers], where each estimated mixture ratio is 0.499 , 0.460 and 0.041, respectively.
The top 10 words seem to be more technical than the bottom 10 words in Table 1.
When the entropy of a word is lower, the word is more topic-specific oriented, i.e., more technical .
In addition, this ranking of words depends on topics assigned to a document.
When we assign randomly chosen topics to the same document, generic terms might be ranked higher.
For example, when we rondomly assign the topics [Rats], [Child] and [Incidence], generic terms such as "use" and "relate" are ranked higher as shown in Table 2.
The estimated mixture ratio of [Rats], [Child] and [Incidence] is 0.411, 0.352 and 0.237, respectively.
For another example, a list of words in ascending order of the entropy in document Y is shown in Table 3.
The actually assigned topics are Female, Animals, Pregnancy and Glucose..
The estimated mixture ratio of [Female], [Animals] ,[Pregnancy] and
exposure
distribution
evaluate
versicolor
Aspergillus
correlate
chrysogenum
positive
chartarum
herbarum
[Glucose] is 0.442, 0.437, 0.066 and 0.055, respectively In this case, we consider assigning sub topics of actual topics to the same document Y.
Table 4 shows a list of words in document Y assigned with the sub topics [Female] and [Animals].
The estimated mixture ratio of [Female] and [Animals] is 0.495 and 0.505, respectively.
Estimated mixture ratio of topics is chaged.
It is interesting that [Female] has higher mixture ratio than [Animals] in actual topics but [Female] has lower mixture ratio than [Animals] in sub topics [Female] and [Animals].
According to these different mixture ratios, the ranking of words in docment Y is changed.
Table 5 shows a list of words in document Y assigned with the sub topics [Pregnancy] and [Glucose].
The estimated mixture ratio of [Pregnancy] and [Glucose] is 0.502 and 0.498, respectively.
It is interesting that in actual topics, the ranking of gglucose-insulinh and "IVGTT" is high in document Y but in the two subset of actual topics, gglucose-insulinh and "IVGTT" cannot be find in Top 10 words.
The important observation known from these examples is that this ranking method of words in a document can be assosiated with topics assigned to the document.
0 depends on 7 seeing Eq.
(28).
This is because the ranking of words depends on assigned topics, concretely, mixture ratios of assigned topics.
TF-IDF computed from the whole documents cannot have this property.
Combined with existing the extraction method of keywords, our model has the potential to extract document-specific keywords using information of assigned topics.
Table 3: Word List of Document Y whose Actual Topics are [Femaile],[Animals],[Pregnancy] and [Glucose]
glucose-insulin
indicate
Table 4: Word List of Document Y whose Topics are [Femaile]and [Animals]
insulin-signaling
euthanasia
undernutrition
conclusion
6 Concluding Remarks
We proposed and evaluated a novel probabilistic generative models, PDMM, to deal with multiple-topic documents.
We evaluated PDMM and other models by comparing F-measure using MEDLINE corpus.
The results showed that PDMM is more effective than PMM.
Moreover, we indicate the potential of the proposed model that extracts document-specific keywords using information of assigned topics.
Acknowledgement This research was funded in part by MEXT Grant-in-Aid for Scientific Research on Priority Areas "i-explosion" in Japan.
Table 5: Word List of Document Y whose Topics are [Pregnancy]and [Glucose]
metabolism
requirement
metabolic
intermediary
pregnant
prenatal
nutrition
gestation
nutrient
offspring
singleton
Learning (Information Science and Statistics), p.687.
Springer-Verlag.
2001.
Neural Information Processing Systems 14.
D.M. Blei, Andrew Y. Ng, and M.I. Jordan.
Latent Dirichlet Allocation.
Journal ofMachine Learning Research, vol.3, pp.993-1022.
Minka 2002.
Estimating a Dirichlet distribution.
Technical Report.
Y.W.Teh, M.IJordan, M.J.Beal, and D.M.Blei.
2003.
Hierarchical dirichlet processes.
Technical Report 653, Department Of Statistics, UC Berkeley.
Parametric mixture models for multi-topic text.
Neural Information Processing Systems 15.
Ueda, N. and Saito, K. 2002.
Singleshot detection of multi-category text using parametric mixture models.
ACM SIG Knowledge Discovery and Data Mining.
Y. Yang and J. Pederson 1997.
A comparative study on feature selection in text categorization.
Proc.
International Conference on Machine Learning.
