We develop latent Dirichlet allocation with WordNet (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable.
We develop a probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each word.
Using the WordNet hierarchy, we embed the construction of Ab-ney and Light (1999) in the topic model and show that automatically learned domains improve WSD accuracy compared to alternative contexts.
1 Introduction
Word sense disambiguation (WSD) is the task of determining the meaning of an ambiguous word in its context.
It is an important problem in natural language processing (NLP) because effective WSD can improve systems for tasks such as information retrieval, machine translation, and summarization.
In this paper, we develop latent Dirichlet allocation with WordNet (LDAWN), a generative probabilistic topic model for WSD where the sense of the word is a hidden random variable that is inferred from data.
There are two central advantages to this approach.
First, with LDAWN we automatically learn the context in which a word is disambiguated.
Rather than disambiguating at the sentence-level or the document-level, our model uses the other words that share the same hidden topic across many documents.
Second, LDAWN is a fully-fledged generative model.
Generative models are modular and can be easily combined and composed to form more com-
plicated models.
(As a canonical example, the ubiquitous hidden Markov model is a series of mixture models chained together.)
Thus, developing a generative model for WSD gives other generative NLP algorithms a natural way to take advantage of the hidden senses of words.
In general, topic models are statistical models of text that posit a hidden space of topics in which the corpus is embedded (Blei et al., 2003).
Given a corpus, posterior inference in topic models amounts to automatically discovering the underlying themes that permeate the collection.
Topic models have recently been applied to information retrieval (Wei and Croft, 2006), text classification (Blei et al., 2003), and dialogue segmentation (Purver et al., 2006).
While topic models capture the polysemous use of words, they do not carry the explicit notion of sense that is necessary for WSD.
LDAWN extends the topic modeling framework to include a hidden meaning in the word generation process.
In this case, posterior inference discovers both the topics of the corpus and the meanings assigned to each of its words.
After introducing a disambiguation scheme based on probabilistic walks over the WordNet hierarchy (Section 2), we embed the WordNet-Walk in a topic model, where each topic is associated with walks that prefer different neighborhoods of WordNet (Section 2.1).
Then, we describe a Gibbs sampling algorithm for approximate posterior inference that learns the senses and topics that best explain a corpus (Section 3).
Finally, we evaluate our system on real-world WSD data, discuss the properties of the topics and disambiguation accuracy results, and draw connections to other WSD algorithms from the research literature.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1024-1033, Prague, June 2007.
©2007 Association for Computational Linguistics
entity I
Figure 1: The possible paths to reach the word "colt" in WordNet.
Dashed lines represent omitted links.
All words in the synset containing "revolver" are shown, but only one word from other synsets is shown.
Edge labels are probabilities of transitioning from synset i to synset j. Note how this favors frequent terms, such as "revolver," over ones like "six-shooter."
2 Topic models and WordNet
number of topics
the successors of synset s in topic k
scalar that, when multiplied by as
normalized vector whose ith entry,
when multiplied by S, gives the prior
probability for going from s to i
multinomial probability vector over
the topics that generate document d
assignment of a word to a topic
a path assignment through
WordNet ending at a word.
set i to synset j.
Table 1: A summary of the notation used in the paper.
Bold vectors correspond to collections of variables (i.e. zu refers to a topic of a single word, but z\:D are the topics assignments of words in document 1 through D).
The WordNet-Walk is a probabilistic process of word generation that is based on the hyponomy relationship in WordNet (Miller, 1990).
WordNet, a lexical resource designed by psychologists and lexicographers to mimic the semantic organization in the human mind, links "synsets" (short for synonym sets) with myriad connections.
The specific relation we're interested in, hyponomy, points from general concepts to more specific ones and is sometimes called the "is-a" relationship.
As first described by Abney and Light (1999), we imagine an agent who starts at synset [entity], which points to every noun in WordNet 2.1 by some sequence of hyponomy relations, and then chooses the next node in its random walk from the hyponyms of its current position.
The agent repeats this process until it reaches a leaf node, which corresponds to a single word (each of the synset's words are unique leaves of a synset in our construction).
For an example of all the paths that might generate the word "colt" see Figure 1.
The WordNet-Walk is parameterized by a set of distributions over children for each synset s in WordNet, /3s.
The WordNet-Walk has two important properties.
First, it describes a random process for word generation.
Thus, it is a distribution over words and thus can be integrated into any generative model of text, such as topic models.
Second, the synset that produces each word is a hidden random variable.
Given a word assumed to be generated by a WordNet-Walk, we can use posterior inference to predict which synset produced the word.
These properties allow us to develop LDAWN, which is a fusion of these WordNet-Walks and latent Dirichlet allocation (LDA) (Blei et al., 2003), a probabilistic model of documents that is an improvement to pLSI (Hofmann, 1999).
LDA assumes that there are K "topics," multinomial distributions over words, which describe a collection.
Each document exhibits multiple topics, and each word in each document is associated with one of them.
Although the term "topic" evokes a collection of ideas that share a common theme and although the topics derived by LDA seem to possess semantic coherence, there is no reason to believe this would
be true of the most likely multinomial distributions that could have created the corpus given the assumed generative model.
That semantically similar words are likely to occur together is a byproduct of how language is actually used.
In LDAWN, we replace the multinomial topic distributions with a WordNet-Walk, as described above.
LDAWN assumes a corpus is generated by the following process (for an overview of the notation used in this paper, see Table 1).
(a) For each synset s, randomly choose transition probabilities pk,a ~ Dir(Sas).
ii.
Create a path Ad,n starting with Ao as the root node.
B. If Ai+1 is a leaf node, generate the associated word.
Otherwise, repeat.
Every element of this process, including the synsets, is hidden except for the words of the documents.
Thus, given a collection of documents, our goal is to perform posterior inference, which is the task of determining the conditional distribution of the hidden variables given the observations.
In the case of LDAWN, the hidden variables are the parameters of the K WordNet-Walks, the topic assignments of each word in the collection, and the synset path of each word.
In a sense, posterior inference reverses the process described above.
Specifically, given a document collection w\:D, the full posterior is
where the constant of proportionality is the marginal likelihood of the observed data.
Note that by encoding the synset paths as a hidden variable, we have posed the WSD problem as a question of posterior probabilistic inference.
Further note that we have developed an unsupervised
model.
No labeled data is needed to disambiguate a corpus.
Learning the posterior distribution amounts to simultaneously decomposing a corpus into topics and its words into their synsets.
The intuition behind LDAWN is that the words in a topic will have similar meanings and thus share paths within WordNet.
For example, WordNet has two senses for the word "colt;" one referring to a young male horse and the other to a type of handgun (see Figure 1).
Although we have no a priori way of knowing which of the two paths to favor for a document, we assume that similar concepts will also appear in the document.
Documents with unambiguous nouns such as "six-shooter" and "smoothbore" would make paths that pass through the synset [firearm, piece, small-arm] more likely than those going through [animal, animate being, beast, brute, creature, fauna] .
In practice, we hope to see a WordNet-Walk that looks like Figure 2, which points to the right sense of cancer for a medical context.
LDAWN is a Bayesian framework, as each variable has a prior distribution.
In particular, the Dirichlet prior for //s, specified by a scaling factor S and a normalized vector as fulfills two functions.
First, as the overall strength of S increases, we place a greater emphasis on the prior.
This is equivalent to the need for balancing as noted by Abney and Light
(1999).
The other function that the Dirichlet prior serves is to enable us to encode any information we have about how we suspect the transitions to children nodes will be distributed.
For instance, we might expect that the words associated with a synset will be produced in a way roughly similar to the token probability in a corpus.
For example, even though "meal" might refer to both ground cereals or food eaten at a single sitting and "repast" exclusively to the latter, the synset [meal, repast, food eaten at a single sitting] still prefers to transition to "meal" over "repast" given the overall corpus counts (see Figure 1, which shows prior transition probabilities for "revolver").
By setting as>i, the prior probability of transition-ing from synset s to node i, proportional to the total number of observed tokens in the children of i,
we introduce a probabilistic variation on information content (Resnik, 1995).
As in Resnik's definition, this value for non-word nodes is equal to the sum of all the frequencies of hyponym words.
Unlike Resnik, we do not divide frequency among all senses of a word; each sense of a word contributes its full frequency to a.
3 Posterior Inference with Gibbs Sampling
As described above, the problem of WSD corresponds to posterior inference: determining the probability distribution of the hidden variables given observed words and then selecting the synsets of the most likely paths as the correct sense.
Directly computing this posterior distribution, however, is not tractable because of the difficulty of calculating the normalizing constant in Equation 1.
To approximate the posterior, we use Gibbs sampling, which has proven to be a successful approximate inference technique for LDA (Griffiths and Steyvers, 2004).
In Gibbs sampling, like all Markov chain Monte Carlo methods, we repeatedly sample from a Markov chain whose stationary distribution is the posterior of interest (Robert and Casella, 2004).
Even though we don't know the full posterior, the samples can be used to form an empirical estimate of the target distribution.
In LDAWN, the samples contain a configuration of the latent semantic states of the system, revealing the hidden topics and paths that likely led to the observed data.
Gibbs sampling reproduces the posterior distribution by repeatedly sampling each hidden variable conditioned on the current state of the other hidden variables and observations.
More precisely, the state is given by a set of assignments where each word is assigned to a path through one of K WordNet-Walk topics: uth word wu has a topic assignment zu and a path assignment Au.
We use z_u and A_u to represent the topic and path assignments of all words except for u, respectively.
Sampling a new topic for the word wu requires us to consider all of the paths that wu can take in each topic and the topics of the other words in the document u is in.
The probability of wu taking on topic i is proportional to
which is the probability of selecting z from 9d times the probability of a path generating wu from a path in the ith WordNet-Walk.
The first term, the topic probability of the uth word, is based on the assignments to the K topics for words other than u in this document,
where n(^,j is the number of words other than u in topic j for the document d that u appears in.
The second term in Equation 2 is a sum over the probabilities of every path that could have generated the word wu.
In practice, this sum can be computed using a dynamic program for all nodes that have unique parent (i.e. those that can't be reached by more than one path).
Although the probability of a path is specific to the topic, as the transition probabilities for a synset are different across topics, we will omit the topic index in the equation,
3.1 Transition Probabilities
Computing the probability of a path requires us to take a product over our estimate of the probability from transitioning from i to j for all nodes i and j in the path A. The other path assignments within this topic, however, play an important role in shaping the transition probabilities.
From the perspective of a single node i, only paths that pass through that node affect the probability of u also passing through that node.
It's convenient to have an explicit count of all of the paths that transition from i to j in this topic's WordNet-Walk, so we use Ti"_ju to represent all of the paths that go from i to j in a topic other than the path currently assigned to u.
Given the assignment of all other words to paths, calculating the probability of transitioning from i to j with word u requires us to consider the prior a and the observations Ti)j- in our estimate of the expected value of the probability of transitioning from i to j,
As mentioned in Section 2.1, we paramaterize the prior for synset i as a vector ai, which sums to one, and a scale parameter S.
The next step, once we've selected a topic, is to select a path within that topic.
This requires the computation of the path probabilities as specified in Equation 4 for all of the paths wu can take in the sampled topic and then sampling from the path probabilities.
The Gibbs sampler is essentially a randomized hill climbing algorithm on the posterior likelihood as a function of the configuration of hidden variables.
The numerator of Equation 1 is proportional to that posterior and thus allows us to track the sampler's progress.
We assess convergence to a local mode of the posterior by monitoring this quantity.
4 Experiments
In this section, we describe the properties of the topics induced by running the previously described Gibbs sampling method on corpora and how these topics improve WSD accuracy.
Of the two data sets used during the course of our evaluation, the primary dataset was SemCor (Miller et al., 1993), which is a subset of the Brown corpus with many nouns manually labeled with the correct WordNet sense.
The words in this dataset are lemmatized, and multi-word expressions that are present in WordNet are identified.
Only the words in SemCor were used in the Gibbs sampling procedure; the synset assignments were only used for assessing the accuracy of the final predictions.
We also used the British National Corpus, which is not lemmatized and which does not have multiword expressions.
The text was first run through a lemmatizer, and then sequences of words which matched a multi-word expression in WordNet were joined together into a single word.
We took nouns that appeared in S emCor twice or in the BNC at least 25 times and used the BNC to compute the information-content analog a for individual nouns (For example, the probabilities in Figure 1 correspond to a).
Like the topics created by structures such as LDA, the topics in Table 2 coalesce around reasonable
themes.
The word list was compiled by summing over all of the possible leaves that could have generated each of the words and sorting the words by decreasing probability.
In the vast majority of cases, a single synset's high probability is responsible for the words' positions on the list.
Reassuringly, many of the top senses for the present words correspond to the most frequent sense in SemCor.
For example, in Topic 4, the senses for "space" and "function" correspond to the top senses in SemCor, and while the top sense for "set" corresponds to "an abstract collection of numbers or symbols" rather than "a group of the same kind that belong together and are so used," it makes sense given the math-based words in the topic.
"Point," however, corresponds to the sense used in the phrase "I got to the point of boiling the water," which is neither the top S emCor sense nor a sense which makes sense given the other words in the topic.
While the topics presented in Table 2 resemble the topics one would obtain through models like LDA (Blei et al., 2003), they are not identical.
Because of the lengthy process of Gibbs sampling, we initially thought that using LDA assignments as an initial state would converge faster than a random initial assignment.
While this was the case, it converged to a state that less probable than the randomly initialized state and no better at sense disambiguation (and sometimes worse).
The topics presented in 2 represent words both that co-occur together in a corpus and co-occur on paths through WordNet.
Because topics created through LDA only have the first property, they usually do worse in terms of both total probability and disambiguation accuracy (see Figure 3).
Another interesting property of topics in LDAWN is that, with higher levels of smoothing, words that don't appear in a corpus (or appear rarely) but are in similar parts of WordNet might have relatively high probability in a topic.
For example, "maturity" in topic two in Table 2 is sandwiched between "foot" and "center," both of which occur about five times more than "maturity."
This might improve LDA-based information retrieval schemes (Wei and Croft, 2006).
___J .
malignancy
Figure 2: The possible paths to reach the word "cancer" in WordNet along with transition probabilities from the medically-themed Topic 2 in Table 2, with the most probable path highlighted.
The dashed lines represent multiple links that have been consolidated, and synsets are represented by their offsets within WordNet 2.1.
Some words for immediate hypernyms have also been included to give context.
In all other topics, the person, animal, or constellation senses were preferred.
president
material
treatment
election
function
administration
official
requirement
polynomial
audience
yesterday
operator
component
production
maturity
communication
direction
petitioner
movement
interest
relationship
Table 2: The most probable words from six randomly chosen WordNet-walks from a thirty-two topic model trained on the words in SemCor.
These are summed over all of the possible synsets that generate the words.
However, the vast majority of the contributions come from a single synset.
Smoothing Factor
Iteration
Figure 3: Topics seeded with LDA initially have a higher disambiguation accuracy, but are quickly matched by unseeded topics.
The probability for the seeded topics starts lower and remains lower.
Because the Dirichlet smoothing factor in part determines the topics, it also affects the disambiguation.
Figure 4 shows the modal disambiguation achieved for each of the settings of S {0.1,1, 5,10,15, 20}.
Each line is one setting of K and each point on the line is a setting of S. Each data point is a run for the Gibbs sampler for 10,000 iterations.
The disambiguation, taken at the mode, improved with moderate settings of S, which suggests that the data are still sparse for many of the walks, although the improvement vanishes if S dominates with much larger values.
This makes sense, as each walk has over 100,000 parameters, there are fewer than 100,000 words in SemCor, and each
Figure 4: Each line represents experiments with a set number of topics and variable amounts of smoothing on the SemCor corpus.
The random baseline is at the bottom of the graph, and adding topics improves accuracy.
As smoothing increases, the prior (based on token frequency) becomes stronger.
Accuracy is the percentage of correctly disambiguated polysemous words in SemCor at the mode.
word only serves as evidence to at most 19 parameters (the length of the longest path in WordNet).
Generally, a greater number of topics increased the accuracy of the mode, but after around sixteen topics, gains became much smaller.
The effect of a is also related to the number of topics, as a value of S for a very large number of topics might overwhelm the observed data, while the same value of S might be the perfect balance for a smaller number of topics.
For comparison, the method of using a WordNet-Walk applied to smaller contexts such as sentences or documents achieves an accuracy of between 26% and 30%, depending on the level of smoothing.
5 Error Analysis
This method works well in cases where the delineation can be readily determined from the overall topic of the document.
Words such as "kid," "may," "shear," "coach," "incident," "fence," "bee," and (previously used as an example) "colt" were all perfectly disambiguated by this method.
Figure 2 shows the WordNet-Walk corresponding to a medical topic that correctly disambiguates "cancer."
Problems arose, however, with highly frequent
words, such as "man" and "time" that have many senses and can occur in many types of documents.
For example, "man" can be associated with many possible meanings: island, game equipment, servant, husband, a specific mammal, etc.
Although we know that the "adult male" sense should be preferred, the alternative meanings will also be likely if they can be assigned to a topic that shares common paths in WordNet; the documents contain, however, many other places, jobs, and animals which are reasonable explanations (to LDAWN) of how "man" was generated.
Unfortunately, "man" is such a ubiquitous term that topics, which are derived from the frequency of words within an entire document, are ultimately uninformative about its usage.
While mistakes on these highly frequent terms significantly hurt our accuracy, errors associated with less frequent terms reveal that WordNet's structure is not easily transformed into a probabilistic graph.
For instance, there are two senses of the word "quarterback," a player in American football.
One is position itself and the other is a person playing that position.
While one would expect co-occurrence in sentences such as "quarterback is a easy position, so our quarterback is happy," the paths to both terms share only the root node, thus making it highly unlikely a topic would cover both senses.
Because of WordNet's breadth, rare senses also impact disambiguation.
For example, the metonymical use of "door" to represent a whole building as in the phrase "girl next door" is under the same parent as sixty other synsets containing "bridge," "balcony," "body," "arch," "floor," and "corner."
Surrounded by such common terms that are also likely to co-occur with the more conventional meanings of door, this very rare sense becomes the preferred disambiguation of "door."
6 Related Work
Abney and Light's initial probabilistic WSD approach (1999) was further developed into a Bayesian network model by Ciaramita and Johnson (2000), who likewise used the appearance of monosemous terms close to ambiguous ones to "explain away" the usage of ambiguous terms in selectional restrictions.
We have adapted these approaches and put them into
the context of a topic model.
Recently, other approaches have created ad hoc connections between synsets in WordNet and then considered walks through the newly created graph.
Given the difficulties of using existing connections in WordNet, Mihalcea (2005) proposed creating links between adjacent synsets that might comprise a sentence, initially setting weights to be equal to the Lesk overlap between the pairs, and then using the PageRank algorithm to determine the stationary distribution over synsets.
Yarowsky was one of the first to contend that "there is one sense for discourse" (1992).
This has lead to the approaches like that of Magnini (Magnini et al., 2001) that attempt to find the category of a text, select the most appropriate synset, and then assign the selected sense using domain annotation attached to WordNet.
LDAWN is different in that the categories are not an a priori concept that must be painstakingly annotated within WordNet and require no augmentation of WordNet.
This technique could indeed be used with any hierarchy.
Our concepts are the ones that best partition the space of documents and do the best job of describing the distinctions of diction that separate documents from different domains.
6.2 Similarity Measures
Our approach gives a probabilistic method of using information content (Resnik, 1995) as a starting point that can be adjusted to cluster words in a given topic together; this is similar to the Jiang-Conrath similarity measure (1997), which has been used in many applications in addition to disambiguation.
Patwardhan (2003) offers a broad evaluation of similarity measures for WSD.
Our technique for combining the cues of topics and distance in WordNet is adjusted in a way similar in spirit to Buitelaar and Sacaleanu (2001), but we consider the appearance of a single term to be evidence for not just that sense and its immediate neighbors in the hyponomy tree but for all of the sense's children and ancestors.
Like McCarthy (2004), our unsupervised system acquires a single predominant sense for a domain based on a synthesis of information derived from a
textual corpus, topics, and WordNet-derived similarity, a probabilistic information content measure.
By adding syntactic information from a thesaurus derived from syntactic features (taken from Lin's automatically generated thesaurus (1998)), McCarthy achieved 48% accuracy in a similar evaluation on SemCor; LDAWN is thus substantially less effective in disambiguation compared to state-of-the-art methods.
This suggests, however, that other methods might be improved by adding topics and that our method might be improved by using more information than word counts.
7 Conclusion and Future Work
The LDAWN model presented here makes two contributions to research in automatic word sense disambiguation.
First, we demonstrate a method for automatically partitioning a document into topics that includes explicit semantic information.
Second, we show that, at least for one simple model of WSD, embedding a document in probabilistic latent structure, i.e., a "topic," can improve WSD.
There are two avenues of research with LDAWN that we will explore.
First, the statistical nature of this approach allows LDAWN to be used as a component in larger models for other language tasks.
Other probabilistic models of language could insert the ability to query synsets or paths of WordNet.
Similarly, any topic based information retrieval scheme could employ topics that include se-mantically relevant (but perhaps unobserved) terms.
Incorporating this model in a larger syntactically-aware model, which could benefit from the local context as well as the document level context, is an important component of future research.
Second, the results presented here show a marked improvement in accuracy as more topics are added to the baseline model, although the final result is not comparable to state-of-the-art techniques.
As most errors were attributable to the hyponomy structure of WordNet, incorporating the novel use of topic modeling presented here with a more mature unsu-pervised WSD algorithm to replace the underlying WordNet-Walk could lead to advances in state-of-the-art unsupervised WSD accuracy.
