In this paper, we analyze the effect of resampling techniques, including under-sampling and over-sampling used in active learning for word sense disambiguation (WSD).
Experimental results show that under-sampling causes negative effects on active learning, but over-sampling is a relatively good choice.
To alleviate the within-class imbalance problem of over-sampling, we propose a bootstrap-based over-sampling (BootOS) method that works better than ordinary over-sampling in active learning for WSD.
Finally, we investigate when to stop active learning, and adopt two strategies, max-confidence and min-error, as stopping conditions for active learning.
According to experimental results, we suggest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for stopping conditions.
1 Introduction
a large sense-tagged corpus is very expensive and time-consuming, because these data have to be annotated by human experts.
Among the techniques to solve the knowledge bottleneck problem, active learning is a promising way (Lewis and Gale, 1994; McCallum and Ni-gram, 1998).
The purpose of active learning is to minimize the amount of human labeling effort by having the system automatically select for human annotation the most informative unannotated case.
In real-world data, the distribution of the senses of a word is often very skewed.
Some studies reported that simply selecting the predominant sense provides superior performance, when a highly skewed sense distribution and insufficient context exist (Hoste et al., 2001; McCarthy et. al., 2004).
The data set is imbalanced when at least one of the senses is heavily underrepresented compared to the other senses.
In general, a WSD classifier is designed to optimize overall accuracy without taking into account the class imbalance distribution in a real-world data set.
The result is that the classifier induced from imbalanced data tends to over-fit the predominant class and to ignore small classes (Japkowicz and Stephen, 2002).
Recently, much work has been done in addressing the class imbalance problem, reporting that resampling methods such as over-sampling and under-sampling are useful in supervised learning with imbalanced data sets to induce more effective classifiers (Estabrooks et al., 2004; Zhou and Liu, 2006).
In general framework of active learning, the learner (i.e. supervised classifier) is formed by using supervised learning algorithms.
To date, however, no-one has studied the effects of over-sampling and under-sampling on active learning
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 783-790, Prague, June 2007.
©2007 Association for Computational Linguistics
methods.
In this paper, we study active learning with resampling methods addressing the class imbalance problem for WSD.
It is noteworthy that neither of these techniques need modify the architecture or learning algorithm, making them very easy to use and extend to other domains.
Another problem in active learning is knowing when to stop the process.
We address this problem in this paper, and discuss how to form the final classifier for use.
This is a problem of estimation of classifier effectiveness (Lewis and Gale, 1994).
Because it is difficult to know when the classifier reaches maximum effectiveness, previous work used a simple stopping condition when the training set reaches desirable size.
However, in fact it is almost impossible to predefine an appropriate size of desirable training data for inducing the most effective classifier.
To solve the problem, we consider the problem of estimation of classifier effectiveness as a second task of estimating classifier confidence.
This paper adopts two strategies: max-confidence and min-error, and suggests a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for the stopping conditions.
2 Related Work
The ability of the active learner can be referred to as selective sampling, of which two major schemes exist: uncertainty sampling and committee-based sampling.
The former method, for example proposed by Lewis and Gale (1994), is to use only one classifier to identify unlabeled examples on which the classifier is least confident.
The latter method (McCallum and Nigam, 1998) generates a committee of classifiers (always more than two classifiers) and selects the next unlabeled example by the principle of maximal disagreement among these classifiers.
With selective sampling, the size of the training data can be significantly reduced for text classification (Lewis and Gale, 1994; McCallum and Nigam, 1998), and word sense disambiguation
(Chen, et al. 2006).
A method similar to committee-based sampling is co-testing proposed by Muslea et al. (2000), which trains two learners individually on two compatible and uncorrelated views that should be able to reach the same classification accuracy.
In practice, however, these conditions of view selec-
tion are difficult to meet in real-world word sense disambiguation tasks.
Recently, much work has been done on the class imbalance problem.
The well-known approach is resampling, in which some training material is duplicated.
Two types of popular resampling methods exist for addressing the class imbalance problem: over-sampling and under-sampling.
The basic idea of resampling methods is to change the training data distribution and make the data more balanced.
It works ok in supervised learning, but has not been tested in active learning.
Previous work reports that cost-sensitive learning is a good solution to the class imbalance problem (Weiss, 2004).
In practice, for WSD, the costs of various senses of a disambiguated word are unequal and unknown, and they are difficult to evaluate in the process of learning.
In recent years, there have been attempts to apply active learning for word sense disambiguation (Chen et al., 2006).
However, to our best knowledge, there has been no such attempt to consider the class imbalance problem in the process of active learning for WSD tasks.
3 Resampling Methods 3.1 Under-sampling
Under-sampling is a popular method in addressing the class imbalance problem by changing the training data distribution by removing some exemplars of the majority class at random.
Some previous work reported that under-sampling is effective in learning on large imbalanced data sets (Japkowicz and Stephen, 2002).
However, as under-sampling removes some potentially useful training samples, it could cause negative effects on the classifier performance.
One-sided sampling is a method similar to under-sampling, in which redundant and borderline training examples are identified and removed from training data (Kubat and Matwin, 1997).
Kuban and Matwin reported that one-sided sampling is effective in learning with two-class large imbal-anced data sets.
However, the relative computational cost of one-sided sampling in active learning is very high, because sampling computations must be implemented for each learning iteration.
Our primitive experimental results show that, in the multi-class problem of WSD, one-sided sampling degrades the performance of active learning.
And
due to the high computation complexity of onesided sampling, we use random under-sampling in our comparison experiments instead.
To control the degree of change of the training data distribution, the ratio of examples from the majority and the minority class after removal from the majority class is called the removal rate (Jo and Japkowicz, 2004).
If the removal rate is 1.0, then under-sampling methods build data sets with complete class balance.
However, it was reported previously that perfect balance is not always the optimal rate (Estabrooks et al., 2004).
In our comparison experiments, we set the removal rate for under-sampling to 0.8, since some cases have 0.8 as the optimal rate reported in (Estabrooks et al.,
2004).
Over-sampling is also a popular method in addressing the class imbalance problem by resampling the small class until it contains as many examples as the large one.
In contrast to under-sampling, over-sampling is the process of adding examples to the minority class, and is accomplished by random sampling and duplication.
Because the process of over-sampling involves making exact copies of examples, it usually increases the training cost and may lead to overfit-ting.
There is a recent variant of over-sampling named SMOTE (Chawla et al., 2002) which is a synthetic minority over-sampling technique.
The authors reported that a combination of SMOTE and under-sampling can achieve better classifier performance in ROC space than only under-sampling the majority class.
In our comparison experiments, we use over-sampling, measured by a resampling rate called the addition rate (Jo and Japkowicz, 2004) that indicates the number of examples that should be added into the minority class.
The addition rate for over-sampling is also set to 0.8 in our experiments.
3.3 Bootstrap-based Over-sampling
While over-sampling decreases the between-class imbalance, it increases the within-class imbalance (Jo and Japkowicz, 2004) because of the increase of exact copies of examples at random.
To alleviate this within-class imbalance problem, we propose a bootstrap-based over-sampling method (BootOS) that uses a bootstrap resampling technique in the process of over-sampling.
Bootstrap-
ping, explained below, is a resampling technique similar to jackknifing.
There are two reasons for choosing a bootstrap method as resampling technique in the process of over-sampling.
First, using a bootstrap set can avoid exactly copying samples in the minority class.
Second, the bootstrap method may give a smoothing of the distribution of the training samples (Hamamoto et al. , 1997), which can alleviate the within-class imbalance problem cased by over-sampling.
To generate the bootstrap set, we use a well-known bootstrap technique proposed by Hama-moto et al. (1997) that does not select samples randomly, allowing all samples in the minority
class(es) an equal chance to be selected._
Find the k nearest neighbor samples xj,1, xj,2, ., xj,k using similarity functions.
Compute a bootstrap sample xBi:
Figure 1.
The BootOS algorithm
Active Learning with Resampling
In this work, we are interested in selective sampling for pool-based active learning, and focus on uncertainty sampling (Lewis and Gale, 1994).
The key point is how to measure the uncertainty of an unlabeled exemplar, and select a new exemplar with maximum uncertainty to augment the training data.
The maximum uncertainty implies that the current classifier has the least confidence in its classification of this exemplar.
The well-known entropy is a good uncertainty measurement widely
where U is the uncertainty measurement function H represents the entropy function.
In the WSD task, p(sj\wi) is the predicted probability of sense sj outputted by the current classifier, when given a
sample i containing a disambiguated word w,.
Algorithm Active-Learning-with-Resampling(L, U,m) Input: Let L be initial small training data set; U the pool of unlabeled exemplars Output: labeled training data set L
Resample L to generate new training data set L* using resampling techniques such as under-sampling, over-sampling or BootOS, and then use L* to train the initial classifier
Loop while adding new instances into L
a. use the current classifier to probabilistically label all unlabeled exemplars in U
b. Based on active learning rules, present m top-ranked exemplars to oracle for labeling
c. Augment L with the m new exemplars, and remove them from U
d. Resample L to generate new training data set L* using resampling techniques such as under-sampling, over-sampling, or BootOS, and use L* to retrain the current classifier
Until the predefined stopping condition is met.
Figure 2.
Active learning with resampling
In step 1 and 2(d) in Fig.
2, if we do not generate L*, and L is used directly to train the current classifier, we call it ordinary active learning.
In the process of active learning, we used the entropy-based uncertainty measurement for all active learning frameworks in our comparison experiments.
Actually our active learning with resampling is a heterogeneous approach in which the classifier used to select new instances is different from the resulting classifier (Lewis and Catlett, 1994).
We utilize a maximum entropy (ME) model (Berger et al., 1996) to design the basic classifier used in active learning for WSD.
The advantage of the ME model is the ability to freely incorporate features from diverse sources into a single, well-grounded statistical model.
A publicly available ME toolkit (Zhang et. al., 2004) was used in our experiments.
In order to extract the linguistic features necessary for the ME model, all sentences containing the target word were automatically part-
of-speech (POS) tagged using the Brill POS tagger (Brill, 1992).
Three knowledge sources were used to capture contextual information: unordered single words in topical context, POS of neighboring words with position information, and local collocations.
These are same as three of the four knowledge sources used in (Lee and Ng, 2002).
Their fourth knowledge source (named syntactic relations) was not used in our work.
5 Stopping Conditions
In active learning algorithm, defining the stopping condition for active learning is a critical problem, because it is almost impossible for the human an-notator to label all unlabeled samples.
This is a problem of estimation of classifier effectiveness (Lewis and Gale 1994).
In fact, it is difficult to know when the classifier reaches maximum effectiveness.
In previous work some researchers used a simple stopping condition when the training set reached a predefined desired size.
It is almost impossible to predefine an appropriate size of desirable training data for inducing the most effective classifier.
To solve the problem, we consider the problem of estimating classifier effectiveness as the problem of confidence estimation of classifier on the remaining unlabeled samples.
Concretely, if we find that the current classifier already has acceptably strong confidence on its classification results for all remained unlabeled data, we assume the current training data is sufficient to train the classifier with maximum effectiveness.
In other words, if a classifier induced from the current training data has strong classification confidence on an unlabeled example, we could consider it as a redundant example.
Based on above analyses, we adopt here two stopping conditions for active learning:
• Max-confidence: This strategy is based on uncertainty measurement, considering whether the entropy of each selected unlabeled example is less than a very small predefined threshold close to zero, such as 0.001.
• Min-error: This strategy is based on feedback from the oracle when the active learner asks for true labels for selected unlabeled examples, considering whether the current trained classifier could correctly predict the labels or the accuracy performance of predictions on
selected unlabeled examples is already larger than a predefined accuacy threshold.
Once max-confidence and min-error conditions are met, the current classifier is assumed to have strong enough confidence on the classification results of all remained unlabeled data.
Evaluation
The data used for our comparison experiments were developed as part of the OntoNotes project (Hovy et al., 2006), which uses the WSJ part of the Penn Treebank (Marcus et al., 1993).
The senses of noun words occurring in OntoNotes are linked to the Omega ontology.
In OntoNotes, at least two humans manually annotate the coarse-grained senses of selected nouns and verbs in their natural sentence context.
To date, OntoNotes has annotated several tens of thousands of examples, covering several hundred nouns and verbs, with an inter-annotator agreement rate of at least 90%.
Those 38 random chosen ambiguous nouns used in all following experiments are shown in Table 1.
It is apparent that the sense distributions of most nouns are very skewed (frequencies shown in the
sense distribution
president
director
management
activity
building
development
In the following active learning comparison experiments, we tested with five resampling methods including random sampling (Random), uncertainty sampling (Ordinary), under-sampling, over-sampling, and BootOS.
The 1-NN technique
was used for bootstrap-based resampling of BootOS in our experiments.
A 5 by 5-fold cross-validation was performed on each noun's data.
We used 20% randomly chosen data for held-out evaluation and the other 80% as the pool of unlabeled data for each round of the active learning.
For all words, we started with a randomly chosen initial training set of 10 examples, and we made 10 queries after each learning iteration.
In the evaluation, average accuracy and recall are used as measures of performances for each active learning method.
Note that the macro-average way is adopted for recall evaluation in each noun WSD task.
The accuracy measure indicates the percentage of testing instances correctly identified by the system.
The macro-average recall measure indicates how well the system performs on each sense.
Experiment 1: Performance comparison experiments on active learning
Figure 3.
Average accuracy performance comparison experiments
Active learning for WSD
Figure 4.
Average recall performance comparison experiments
As shown in Fig.
3 and Fig.
4, when the number of learned samples for each noun is smaller than 120, the BootOS has the best performance, followed by over-sampling and ordinary method.
As the number of learned samples increases, ordinary, over-sampling and BootOS have similar performances on accuracy and recall.
Our experiments also exhibit that random sampling method is the worst on both accuracy and recall.
Previous work (Estabrooks et al., 2004) reported that under-sampling of the majority class (predominant sense) has been proposed as a good means of increasing the sensitivity of a classifier to the minority class (infrequent sense).
However, in our active learning experiments, under-sampling is apparently worse than ordinary, over-sampling and our BootOS.
The reason is that in highly imbal-anced data, too many useful training samples of majority class are discarded in under-sampling, causing the performance of active learning to degrade.
Experiment 2: Effectiveness of learning instances for infrequent senses
It is important to enrich the corpora by learning more instances for infrequent senses using active learning with less human labeling.
This procedure not only makes the corpora 'richer', but also alleviates the domain dependence problem faced by corpus-based supervised approaches to WSD.
The objective of this experiment is to evaluate the performance of active learning in learning samples of infrequent senses from an unlabeled corpus.
Due to highly skewed word sense distributions in our data set, we consider all senses other than the predominant sense as infrequent senses in this experiment.
u, Active learning for WSD
» 0.45 .............................
Number of learned samples
Figure 5.
Comparison experiments on learning instances for infrequent senses
Fig.
5 shows that random sampling is the worst in active learning for infrequent senses.
The reason is very obvious: the sense distribution of the learned sample set by random sampling is almost identical to that of the original data set.
Under-sampling is apparently worse than ordinary active learning, over-sampling and BootOS methods.
When the number of learned samples for each noun is smaller than 80, BootOS achieves slight better performance than ordinary active learning and over-sampling.
When the number of learned samples is larger than 80 and smaller than 160, these three methods exhibit similar performance.
As the number of iterations increases, ordinary active learning is slightly better than over-sampling and BootOS.
In fact, after the 16th iteration (10 samples chosen in each iteration), results indicate that most instances for infrequent senses have been learned.
Experiment 3: Effectiveness of Stopping Conditions for active learning
To evaluate the effectiveness of two strategies max-confidence and min-error as stopping conditions of active learning, we first construct an ideal stopping condition when the classifier could reach the highest accuracy performance at the first time in the procedure of active learning.
When the ideal stopping condition is met, it means that the current classifier has reached maximum effectiveness.
In practice, it is impossible to exactly know when the ideal stopping condition is met before all unlabeled data are labeled by a human annotator.
We only use this ideal method in our comparison experiments to analyze the effectiveness of our two proposed stopping conditions.
For general purpose, we focus on the ordinary active learning to design the basic system, and to evaluate the effectiveness of three stop conditions.
In the following experiments, the entropy threshold used in max-confidence strategy is set to 0.001, and the accuracy threshold used in min-error strategy is set to 0.9.
In Table 2, the column "Size"" stands for the size of unlabeled data set of corresponding noun word used in active learning.
There are two columns for each stopping condition: the left column "num" presents number of learned instances and the right column "%" presents its percentage over all data when the corresponding stopping condition is met.
Max-confidence
Min-error
Management
Position
administration
Development
Strategy
President
Director
Activity
Building
Table 2 Effectiveness of three stopping conditions
As shown in Table 2, the min-error strategy based on feedback of human annotator is very close to the ideal method.
Therefore, when comparing to ideal stopping condition, min-error strategy is a good choice as stopping condition for active learning.
It is important to note that the min-error method does not need more additional computational costs, it only depends upon the feedback of human annotator when labeling the chosen unlabeled samples.
From experimental results, we can see that max-confidence strategy is worse than min-error method.
However, we believe that the entropy of each unlabeled sample is a good signal to stop active learning.
So we suggest that there may be a good prediction solution in which the min-error strategy is used as the lower-bound of stopping condition, and max-confidence strategy as the upper-bound of stopping condition for active learning.
7 Discussion
As discussed above, finding more instances for infrequent senses at the earlier stages of active learning is very significant in making the corpus richer, meaning less effort for human labeling.
In practice, another way to learn more instances for infrequent senses is to first build a training data set by active learning or by human efforts, and then build a supervised classifier to find more instances for infrequent sense.
However, it is interesting to know how much initial training data is enough for this task, and how much human labeling efforts could be saved.
From experimental results, we found that among these chosen unlabeled instances by active learner, some instances are informative samples helpful for improving classification performance, and other instances are borderline samples which are unreliable because even a small amount of noise can lead the sample to the wrong side of the decision boundary.
The removal of these borderline samples might improve the performance of active learning.
The proposed prediction solution based on max-confidence and min-error strategies is a coarse framework.
To predict when to stop active learning procedure, it is logical to consider the changes of accuracy performance of the classifier as a signal to stop the learning iteration.
In other words, during the range predicted by the proposed solution, if the change of accuracy performance of the learner (classifier) is very small, we could assume that the current classifier has reached maximum effectiveness.
8 Conclusion and Future Work
In this paper, we consider the class imbalance problem in WSD tasks, and analyze the effect of resampling techniques including over-sampling and under-sampling in active learning.
Experimental results show that over-sampling is a relatively good choice in active learning for WSD in highly imbalanced data.
Under-sampling causes negative effect on active learning.
A new over-sampling method named BootOS based on bootstrap technique is proposed to alleviate the within-class imbalance problem of over-sampling, and works better than ordinary over-sampling in active learning for WSD.
It is noteworthy that none of these techniques require to modify the architecture or
learning algorithm; therefore, they are very easy to use and extend to other applications.
To predict when to stop active learning, we adopt two strategies including max-confidence and min-error as stopping conditions.
According to our experimental results, we suggest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound of stopping conditions for active learning.
In the future work, we will study how to exactly identify these borderline samples thus they are not firstly selected in active learning procedure.
The borderline samples have the higher entropy values meaning least confident for the current classifier.
The borderline instances can be detected using the concept of Tomek links (Tomek 1976).
It is also worth studying cost-sensitive learning for active learning with imbalanced data, and using such techniques for WSD.
