This paper presents a novel approach for exploiting the global context for the task of word sense disambiguation (WSD).
This is done by using topic features constructed using the latent dirichlet allocation (LDA) algorithm on unlabeled data.
The features are incorporated into a modified naive Bayes network alongside other features such as part-of-speech of neighboring words, single words in the surrounding context, local collocations, and syntactic patterns.
In both the English all-words task and the English lexical sample task, the method achieved significant improvement over the simple naive Bayes classifier and higher accuracy than the best official scores on Senseval-3 for both task.
1 Introduction
Natural language tends to be ambiguous.
A word often has more than one meanings depending on the context.
Word sense disambiguation (WSD) is a natural language processing (NLP) task in which the correct meaning (sense) of a word in a given context is to be determined.
Supervised corpus-based approach has been the most successful in WSD to date.
In such an approach, a corpus in which ambiguous words have been annotated with correct senses is first collected.
Knowledge sources, or features, from the context of the annotated word are extracted to form the training data.
A learning algorithm, like the support vector
machine (SVM) or naive Bayes, is then applied on the training data to learn the model.
Finally, in testing, the learnt model is applied on the test data to assign the correct sense to any ambiguous word.
The features used in these systems usually include local features, such as part-of-speech (POS) of neighboring words, local collocations , syntactic patterns and global features such as single words in the surrounding context (bag-of-words) (Lee and Ng, 2002).
However, due to the data scarcity problem, these features are usually very sparse in the training data.
There are, on average, 11 and 28 training cases per sense in Senseval 2 and 3 lexical sample task respectively, and 6.5 training cases per sense in the SemCor corpus.
This problem is especially prominent for the bag-of-words feature; more than hundreds of bag-of-words are usually extracted for each training instance and each feature could be drawn from any English word.
A direct consequence is that the global context information, which the bag-of-words feature is supposed to capture, may be poorly represented.
Our approach tries to address this problem by clustering features to relieve the scarcity problem, specifically on the bag-of-words feature.
In the process, we construct topic features, trained using the latent dirichlet allocation (LDA) algorithm.
We train the topic model (Blei et al., 2003) on unlabeled data, clustering the words occurring in the corpus to a predefined number of topics.
We then use the resulting topic model to tag the bag-of-words in the labeled corpus with topic distributions.
We incorporate the distributions, called the topic features, using a simple Bayesian network, modified from naive Bayes
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1G15-1G23, Prague, June 2GG7.
©2GG7 Association for Computational Linguistics
model, alongside other features and train the model on the labeled corpus.
The approach gives good performance on both the lexical sample and all-words tasks on Senseval data.
The paper makes mainly two contributions.
First, we are able to show that a feature that efficiently captures the global context information using LDA algorithm can significantly improve the WSD accuracy.
Second, we are able to obtain this feature from unlabeled data, which spares us from any manual labeling work.
We also showcase the potential strength of Bayesian network in the WSD task, obtaining performance that rivals state-of-arts methods.
2 Related Work
Many WSD systems try to tackle the data scarcity problem.
Unsupervised learning is introduced primarily to deal with the problem, but with limited success (Snyder and Palmer, 2004).
In another approach, the learning algorithm borrows training instances from other senses and effectively increases the training data size.
In (Kohomban and Lee, 2005), the classifier is trained using grouped senses for verbs and nouns according to WordNet top-level synsets and thus effectively pooling training cases across senses within the same synset.
Similarly, (Ando, 2006) exploits data from related tasks, using all labeled examples irrespective of target words for learning each sense using the Alternating Structure Optimization (ASO) algorithm (Ando and Zhang, 2005a; Ando and Zhang, 2005b).
Parallel texts is proposed in (Resnik and Yarowsky, 1997) as potential training data and (Chan and Ng, 2005) has shown that using automatically gathered parallel texts for nouns could significantly increase WSD accuracy, when tested on Senseval-2 English all-words task.
Our approach is somewhat similar to that of using generic language features such as POS tags; the words are tagged with its semantic topic that may be trained from other corpuses.
3 Feature Construction
We first present the latent dirichlet allocation algorithm and its inference procedures, adapted from the original paper (Blei et al., 2003).
3.1 Latent Dirichlet Allocation
LDA is a probabilistic model for collections of discrete data and has been used in document modeling and text classification.
It can be represented as a three level hierarchical Bayesian model, shown graphically in Figure 1.
Given a corpus consisting of M documents, LDA models each document using a mixture over K topics, which are in turn characterized as distributions over words.
Figure 1: Graphical Model for LDA
In the generative process of LDA, for each document d we first draw the mixing proportion over topics 9d from a Dirichlet prior with parameters a. Next, for each of the Nd words wdn in document d, a topic zdn is first drawn from a multinomial distribution with parameters 9d.
Finally wdn is drawn from the topic specific distribution over words.
The probability of a word token w taking on value i given that topic z = j was chosen is parameterized using a matrix P with = p(w = i|z = j).
Integrating out 0d's and zdn's, the probability p(D|a, P) of the corpus is thus:
Unfortunately, it is intractable to directly solve the posterior distribution ofthe hidden variables given a document, namely p(9, z|w, a, P).
However, (Blei et al., 2003) has shown that by introducing a set of variational parameters, 7 and 0, a tight lower bound on the log likelihood of the probability can be found using the following optimization procedure:
7 is the Dirichlet parameter for 0 and the multinomial parameters (01 • • • 0N) are the free variational parameters.
Note here 7 is document specific instead of corpus specific like a. Graphically, it is represented as Figure 2.
The optimizing values of 7 and 0 can be found by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior.
Figure 2: Graphical Model for Variational Inference 3.2 Baseline Features
For both the lexical sample and all-words tasks, we use the following standard baseline features for comparison.
POS Tags For each training or testing word, w, we include POS tags for P words prior to as well as after w within the same sentence boundary.
We also include the POS tag of w. If there are fewer than P words prior or after w in the same sentence, we denote the corresponding feature as NIL.
Local Collocations Collocation Ci;j refers to the ordered sequence of tokens (words or punctuations) surrounding w. The starting and ending position of the sequence are denoted i and j respectively, where a negative value refers to the token position prior to w. We adopt the same 11 collocation features as (Lee and Ng, 2002), namely C_i_i, CM, C-2,-2,
and C1;3.
Bag-of-Words For each training or testing word, w, we get G words prior to as well as after w, within the same document.
These features are position insensitive.
The words we extract are converted back to their morphological root forms.
Syntactic Relations We adopt the same syntactic relations as (Lee and Ng, 2002).
For easy reference, we summarize the features into Table 1.
Features
Relative position of h to w
Adjective
Parent headword h POS of h
Table 1: Syntactic Relations Features
The exact values of P and G for each task are set according to cross validation result.
We first select an unlabeled corpus, such as 20 Newsgroups, and extract individual words from it (excluding stopwords).
We choose the number of topics, K, for the unlabeled corpus and we apply the LDA algorithm to obtain the P parameters, where P represents the probability of a word w» given a topic zj, p(wj|zj) = Pij.
The model essentially clusters words that occurred in the unlabeled corpus according to K topics.
The conditional probability p(wi|zj) = Pij is later used to tag the words in the unseen test example with the probability of each topic.
For some variants of the classifiers that we construct, we also use the 7 parameter, which is document specific.
For these classifiers, we may need to run the inference algorithm on the labeled corpus and possibly on the test documents.
The 7 parameter provides an approximation to the probability of
selecting topic i in the document:
Ek V 4 Classifier Construction
We construct a variant of the naïve Bayes network as shown in Figure 3.
Here, w refers to the word. s refers to the sense of the word.
In training, s is observed while in testing, it is not.
The features / to /n are baseline features mentioned in Section 3.2 (including bag-of-words) while z refers to the latent topic that we set for clustering unlabeled corpus.
The bag-of-words b are extracted from the neighbours of w and there are L of them.
Note that L can be different from G, which is the number of bag-of-words in baseline features.
Both will be determined by the validation result.
Figure 3: Graphical Model with LDA feature
The log p(w) term is constant and thus can be ignored.
The first portion is normal naive Bayes.
And second portion represents the additional LDA plate.
We decouple the training process into three separate stages.
We first extract baseline features from the task training data, and estimate, using normal naive Bayes, p(s|w) and p(f |s) for all w, s and f. The parameters associated with p(b|z) are estimated using LDA from unlabeled data.
Finally we estimate the parameters associated with p(z|s).
We experimented with three different ways of both doing the estimation as well as using the resulting model and chose one which performed best empirically.
4.1.1 Expectation Maximization Approach
For p(z|s), a reasonable estimation method is to use maximum likelihood estimation.
This can be done using the expectation maximization (EM) algorithm.
In classification, we just choose s* that maximizes the log-likelihood of the test instance, where:
In this approach, 7 is never used which means the LDA inference procedure is not used on any labeled data at all.
Classification in this approach is done using the full Bayesian network just as in the EM approach.
However we do the estimation of p(z|s) differently.
Essentially, we perform LDA inference on the training corpus in order to obtain 7 for each document.
We then use the 7 and // to obtain p(z|b) for each word using
where equation [1] is used for estimation of p(zi|7).
This effectively transforms b to a topical distribution which we call a soft tag where each soft tag is probability distribution t1,..., tK on topics.
We then use this topical distribution for estimating p(z|s).
Let si be the observed sense of instance i and t1j,..., tj be the soft tag of the j-th bag-of-word feature of instance i. We estimate p(z|s) as
This approach requires us to do LDA inference on the corpus formed by the labeled training data, but
not the testing data.
This is because we need 7 to get transformed topical distribution in order to learn p(z|s) in the training.
In the testing, we only apply the learnt parameters to the model.
Hard tagging approach no longer assumes that z is latent.
After p(z|b) is obtained using the same procedure in Section 4.1.2, the topic Zj with the highest p(zi|b) among all K topics is picked to represent z. In this way, b is transformed into a single most "prominent" topic.
This topic label is used in the same way as baseline features for both training and testing in a simple naive Bayes model.
This approach requires us to perform the transformation both on the training as well as testing data, since z becomes an observed variable.
LDA inference is done on two corpora, one formed by the training data and the other by testing data, in order to get the respective values of 7.
4.2 Support Vector Machine Approach
In the SVM (Vapnik, 1995) approach, we first form a training and a testing file using all standard features for each sense following (Lee and Ng, 2002) (one classifier per sense).
To incorporate LDA feature, we use the same approach as Section 4.1.2 to transform b into soft tags, p(z|b).
As SVM deals with only observed features, we need to transform b both in the training data and in the testing data.
Compared to (Lee and Ng, 2002), the only difference is that for each training and testing case, we have additional L * K LDA features, since there are L bag-of-words and each has a topic distribution represented by K values.
5 Experimental Setup
We describe here the experimental setup on the English lexical sample task and all-words task.
We use MXPOST tagger (Adwait, 1996) for POS tagging, Charniak parser (Charniak, 2000) for extracting syntactic relations, SVMlight1 for SVM classifier and David Blei's version of LDA2 for LDA training and inference.
All default parameters are used unless mentioned otherwise.
For all standard
baseline features, we use Laplace smoothing but for the soft tag (equation [2]), we use a smoothing parameter value of 2.
We use the Senseval-2 lexical sample task for preliminary investigation of different algorithms, datasets and other parameters.
As the dataset is used extensively for this purpose, only the Senseval-3 lexical sample task is used for evaluation.
Selecting Bayesian Network The best achievable result, using the three different Bayesian network approaches, when validating on Senseval-2 test data is shown in Table 2.
The parameters that are used are P = 3 and G = 3.
Hard Tagging
Soft Tagging
Table 2: Results on Senseval-2 English lexical sample using different Bayesian network approaches.
From the results, it appears that both the EM and the Hard Tagging approaches did not yield as good results as the Soft Tagging approach did.
The EM approach ignores the LDA inference result, 7, which we use to get our topic prior.
This information is document specific and can be regarded as global context information.
The Hard Tagging approach also uses less information, as the original topic distribution is now represented only by the topic with the highest probability ofoccurring.
Therefore, both methods have information loss and are disadvan-taged against the Soft Tagging approach.
We use the Soft Tagging approach for the Senseval-3 lexical sample and the all-words tasks.
Unlabeled Corpus Selection The unlabeled corpus we choose to train LDA include 20 Newsgroups, Reuters, SemCor, Senseval-2 lexical sample data and Senseval-3 lexical sample data.
Although the last three are labeled corpora, we only need the words from these corpora and thus they can be regarded as unlabeled too.
For Senseval-2 and Senseval-3 data, we define the whole passage for each training and testing instance as one document.
The relative effect using different corpus and combinations of them is shown in Table 3, when validating on Senseval-2 test data using the Soft Tagging approach.
20 Newsgroups
Table 3: Effect of using different corpus for LDA training, |w| represents the corpus size in terms of the number of words in the corpus
The 20 Newsgroups corpus yields the best result if used individually.
It has a relatively larger corpus size at 1.7 million words in total and also a well balanced topic distribution among its documents, ranging across politics, finance, science, computing, etc. The Reuters corpus, on the other hand, focuses heavily on finance related articles and has a rather skewed topic distribution.
This probably contributed to its inferior result.
However, we found that the best result comes from combining all the corpora together with K = 60 and L = 40.
Results for Optimized Configuration As baseline for the Bayesian network approaches, we use naive Bayes with all baseline features.
For the baseline SVM approach, we choose P = 3 and include all the words occurring in the training and testing passage as bag-of-words feature.
The F-measure result we achieve on Senseval-2 test data is shown in Table 4.
Our four systems are listed as the top four entries in the table.
Soft Tag refers to the soft tagging Bayesian network approach.
Note that we used the Senseval-2 test data for optimizing the configuration (as is done in the ASO result).
Hence, the result should not be taken as reliable.
Nevertheless, it is worth noting that the improvement of Bayesian network approach over its baseline is very significant (+5.5%).
On the other hand, SVM with topic features shows limited improvement over its baseline (+0.8%).
SVM-Topic
Classifier Combination(Florian, 2002)
Senseval-2 Best System
Table 4: Results (best configuration) compared to previous best systems on Senseval-2 English lexical sample task.
In the all-words task, no official training data is provided with Senseval.
We follow the common practice of using the SemCor corpus as our training data.
However, we did not use SVM approach in this task as there are too few training instances per sense for SVM to achieve a reasonably good accuracy.
As there are more training instances in SemCor, 230, 000 in total, we obtain the optimal configuration using 10 fold cross validation on the SemCor training data.
With the optimal configuration, we test our system on both Senseval-2 and Senseval-3 official test data.
For baseline features, we set P = 3 and B = 1.
We choose a LDA training corpus comprising 20 Newsgroups and SemCor data, with number of topics K = 40 and number of LDA bag-of-words L = 14.
6 Results
We now present the results on both English lexical sample task and all-words task.
With the optimal configurations from Senseval-2, we tested the systems on Senseval-3 data.
Table 5 shows our F-measure result compared to some ofthe best reported systems.
Although SVM with topic features shows limited success with only a 0.6% improvement, the Bayesian network approach has again demonstrated a good improvement of 3.8% over its baseline and is better than previous reported best systems except ASO(Ando, 2006).
Table 5: Results compared to previous best systems on Senseval-3 English lexical sample task.
to verify the significance of these results.
The result is reported in Table 8.
The results are significant at 90% confidence level, except for the Senseval-3 all-words task.
Senseval-2
Senseval-3
All-word Lexical Sample
Table 8: P value for %2-test significance levels of results.
The F-measure micro-averaged result for our systems as well as previous best systems for Senseval-2 and Senseval-3 all-words task are shown in Table 6 and Table 7 respectively.
Bayesian network with soft tagging achieved 2.6% improvement over its baseline in Senseval-2 and 1.7% in Senseval-3.
The results also rival some previous best systems, except for SMUaw (Mihalcea, 2002) which used additional labeled data.
Bayes (Soft Tag) NB baseline
Table 6: Results compared to previous best systems on Senseval-2 English all-words task.
Senseval-3 Best System
Senseval-3 2nd Best System (SenseLearner
Table 7: Results compared to previous best systems on Senseval-3 English all-words task.
6.3 Significance of Results
We perform the %2-test, using the Bayesian network and its naive Bayes baseline (NB baseline) as pairs,
The results on lexical sample task show that SVM benefits less from the topic feature than the Bayesian approach.
One possible reason is that SVM baseline is able to use all bag-of-words from surrounding context while naive Bayes baseline can only use very few without decreasing its accuracy, due to the sparse representation.
In this sense, SVM baseline already captures some of the topical information, leaving a smaller room for improvement.
In fact, if we exclude the bag-of-words feature from the SVM baseline and add in the topic features, we are able to achieve almost the same accuracy as we did with both features included, as shown in Table 9.
This further shows that the topic feature is a better representation of global context than the bag-of-words feature.
SVM baseline
SVM-topic
Table 9: Results on Senseval-3 English lexical sample task
6.5 Results on Different Parts-of-Speech
We analyse the result obtained on Senseval-3 English lexical sample task (using Senseval-2 optimal configuration) according to the test instance's part-of-speech, which includes noun, verb and adjective, compared to the naive Bayes baseline.
Table 10 shows the relative improvement on each part-of-speech.
The second column shows the number of testing instances belonging to the particular part-of-speech.
The third and fourth column shows the
Accuracy with varing L and K on all-words task
all-words task data as our validation set to fine tune the parameters.
For lexical sample task, we use the training data provided as the validation set.
We achieved 88.7%, 81.6% and 57.6% for coarsegrained lexical sample task, coarse-grained all-words task and fine-grained all-words task respectively.
The results ranked first, second and fourth in the three tasks respectively.
7 Conclusion and Future Work
In this paper, we showed that by using LDA algorithm on bag-of-words feature, one can utilise more topical information and boost the classifiers accuracy on both English lexical sample and all-words task.
Only unlabeled data is needed for this improvement.
It would be interesting to see how the feature can help on WSD of other languages and other natural language processing tasks such as named-entity recognition.
accuracy achieved by naive Bayes baseline and the Bayesian network.
Adjectives show no improvement while verbs show a moderate +2.2% improvement.
Nouns clearly benefit from topical information much more than the other two parts-of-speech, obtaining a +5.7% increase over its baseline.
NB baseline
Table 10: Improvement with different POS on Senseval-3 lexical sample task
We tested on Senseval-2 all-words task using different L and K. Figure 4 is the result.
We participated in SemEval-1 English coarsegrained all-words task (task 7), English fine-grained all-words task (task 17, subtask 3) and English coarse-grained lexical sample task (task 17, subtask 1), using the method described in this paper.
For all-words task, we use Senseval-2 and Senseval-3
