This paper proposes a framework for semi-supervised structured output learning (SOL), specifically for sequence labeling, based on a hybrid generative and discriminative approach.
We define the objective function of our hybrid model, which is written in log-linear form, by discriminatively combining discriminative structured predic-tor(s) with generative model(s) that incorporate unlabeled data.
Then, unlabeled data is used in a generative manner to increase the sum of the discriminant functions for all outputs during the parameter estimation.
Experiments on named entity recognition (CoNLL-2003) and syntactic chunking (CoNLL-2000) data show that our hybrid model significantly outperforms the state-of-the-art performance obtained with supervised SOL methods, such as conditional random fields (CRFs).
1 Introduction
Structured output learning (SOL) methods, which attempt to optimize an interdependent output space globally, are important methodologies for certain natural language processing (NLP) tasks such as part-of-speech tagging, syntactic chunking (Chunking) and named entity recognition (NER), which are also referred to as sequence labeling tasks.
When we consider the nature of these sequence labeling tasks, a semi-supervised approach appears to be more natural and appropriate.
This is because the number of features and parameters typically become extremely large, and labeled examples can only sparsely cover the parameter space, even if thousands of labeled ex-
Scheffer, 2006).
With the generative approach, we can easily incorporate unlabeled data into probabilistic models with the help of expectation-maximization (EM) algorithms (Dempster et al., 1977).
For example, the Baum-Welch algorithm is a well-known algorithm for training a hidden Markov model (HMM) of sequence learning.
Generally, with sequence learning tasks such as NER and Chunking, we cannot expect to obtain better performance than that obtained using discriminative approaches in supervised learning settings.
In contrast to the generative approach, with the discriminative approach, it is not obvious how un-labeled training data can be naturally incorporated into a discriminative training criterion.
For example, the effect of unlabeled data will be eliminated from the objective function if the unlabeled data is directly used in traditional i.i.d. conditional-probability models.
Nevertheless, several attempts have recently been made to incorporate unlabeled data in the discriminative approach.
An approach based on pairwise similarities, which encourage nearby data points to have the same class label, has been proposed as a way of incorporating unlabeled data discriminatively (Zhu et al., 2003; Altun et al.,
approach generally requires joint inference over the whole data set for prediction, which is not practical as regards the large data sets used for standard sequence labeling tasks in NLP.
Another discriminative approach to semi-supervised SOL involves the incorporation of an entropy regularizer (Grand-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 191-800, Prague, June 2001.
©2001 Association for Computational Linguistics
valet and Bengio, 2004).
Semi-supervised conditional random fields (CRFs) based on a minimum entropy regularizer (SS-CRF-MER) have been proposed in (Jiao et al., 2006).
With this approach, the parameter is estimated to maximize the likelihood of labeled data and the negative conditional entropy of unlabeled data.
Therefore, the structured predictor is trained to separate unlabeled data well under the entropy criterion by parameter estimation.
In contrast to these previous studies, this paper proposes a semi-supervised SOL framework based on a hybrid generative and discriminative approach.
A hybrid approach was first proposed in a supervised learning setting (Raina et al., 2003) for text classification.
(Fujino et al., 2005) have developed a semi-supervised approach by discriminatively combining a supervised classifier with generative models that incorporate unlabeled data.
We extend this framework to the structured output domain, specifically for sequence labeling tasks.
Moreover, we re-formalize the objective function to allow the incorporation of discriminative models (structured predictors) trained from labeled data, since the original framework only considers the combination of generative classifiers.
As a result, our hybrid model can significantly improve on the state-of-the-art performance obtained with supervised SOL methods, such as CRFs, even if a large amount of labeled data is available, as shown in our experiments on CoNLL-
addition, compared with SS-CRF-MER, our hybrid model has several good characteristics including a low calculation cost and a robust optimization in terms of a sensitiveness of hyper-parameters.
This is described in detail in Section 5.3.
2 Supervised SOL: CRFs
This paper focuses solely on sequence labeling tasks, such as named entity recognition (NER) and syntactic chunking (Chunking), as SOL problems.
Thus, let x=(x1}..., xs) be an input sequence, and y=(y0,..., ys+i) be a particular output sequence, where yo and ys+i are special fixed labels that represent the beginning and end of a sequence.
As regards supervised sequence learning, CRFs are recently introduced methods that constitute flexible and powerful models for structured predictors based on undirected graphical models that have been
globally conditioned on a set of inputs (Lafferty et al., 2001).
Let X be a parameter vector and f (ys-1,ys, x) be a (local) feature vector obtained from the corresponding position s given x. CRFs define the conditional probability, p(y\x), as being proportional to a product of potential functions on the cliques.
That is, p(y\x) on a (linear-chain) CRF can be defined as follows:
p(y\x; X) = Zx)Ilexp(X • f (Vs-i,ys, x)).
Z(x) = E„ Ftf+i1 exP(X • f (ys-i,ys, x)) is a normalization factor over all output values, Y, and is also known as the partition function.
For parameter estimation (training), given labeled data V\ = {(xk, yk)}K=1, the Maximum a Posteriori (MAP) parameter estimation, namely maximizing logp(X\Vi), is now the most widely used CRF training criterion.
Thus, we maximize the following objective function to obtain optimal X:
-J2 EP(y\xk-,x)[J2 f >\ + V log p(X).
Calculating Ep(y\x^X) as well as the partition function Z(x) is not always tractable.
However, for linear-chain CRFs, a dynamic programming algorithm similar in nature to the forward-backward algorithm in HMMs has already been developed for an efficient calculation (Lafferty et al., 2001).
For prediction, the most probable output, that is, y = argmaxy^y p(y\x; X), can be efficiently obtained by using the Viterbi algorithm.
3 Hybrid Generative and Discriminative Approach to Semi-Supervised SOL
In this section, we describe our formulation of a hybrid approach to SOL and a parameter estimation method for sequence predictors.
We assume
Du = {xm}m=1.
Let us assume that we have I-units of discriminative models, pD, and J-units of generative models, pG.
Our hybrid model for a structured predictor is designed by the discriminative combination of several joint probability densities of x and y, p(x, y).
That is, the posterior probability of our hybrid model is defined by providing the log-values of p(x, y) as the features of a log-linear model, such that:
discriminative combination weight of each model where S [0,1].
Moreover, A={Xi}j=1 and 0= {0j }J=1 represent model parameters of individual models estimated from labeled and unlabeled data, respectively.
Using pD(x, y) = pD(y\x)pD(x),we can derive the third line from the second line, where pD(x; Xi)Yi for all i are canceled out.
Thus, our hybrid model is constructed by combining discriminative models, piD(y\ x; Xi), with generative models, pf(x, y; 0j).
Hereafter, let us assume that our hybrid model consists of CRFs for discriminative models, pD, and HMMs for generative models, pG, shown in Equation (2), since this paper focuses solely on sequence modeling.
For HMMs, we consider a first order HMM defined in the following equation:
where 0.
and 9ysXs represent the transition probability between states ys-i and ys and the symbol emission probability of the s-th position of the corresponding input sequence, respectively, where
Qys+i,xs+i = 1.
It can be seen that the formalization in the loglinear combination of our hybrid model is very similar to that of LOP-CRFs (Smith et al., 2005).
fact, if we only use a combination of discriminative
models (CRFs), which is equivalent to Yj = 0 for all j, we obtain essentially the same objective function as that of the LOP-CRFs.
Thus, our framework can also be seen as an extension of LOP-CRFs that enables us to incorporate unlabeled data.
3.1 Discriminative Combination
For estimating the parameter r, let us assume that we already have discriminatively trained models on labeled data, pD(y\x; Aj).
We maximize the following objective function for estimating parameter r under a fixed 0 :
where p(T) is a prior probability distribution of r.
The value of r providing a global maximum of £Hys°L(r\0) is guaranteed under an arbitrary fixed value in the 0 domain, since CHySOL(T\0) is a concave function of r. Thus, we can easily maximize Equation (3) by using a gradient-based optimization algorithm such as (bound constrained) L-BFGS (Liu andNocedal, 1989).
3.2 Incorporating Unlabeled Data
We cannot directly incorporate unlabeled data for discriminative training such as Equation (3) since the correct outputs y for unlabeled data are unknown.
On the other hand, generative approaches can easily deal with unlabeled data as incomplete data (data with missing variable y) by using a mixture model.
A well-known way to achieve this incorporation is to maximize the log likelihood of un-labeled data with respect to the marginal distribution of generative models as
£(0) = J2iogJ2 p(xm, y; O).
In fact, (Nigam et al., 2000) have reported that using unlabeled data with a mixture model can improve the text classification performance.
According to Bayes' rule, p(y\x; O) oc p(x, y; O), the discriminant functions of generative classifiers are provided by generative models p(x, y; O).
Therefore, we can regard L(O) as the logarithm of the sum of discriminant functions for all missing variables y of unlabeled data.
Following this view, we can directly incorporate unlabeled data into our hybrid model by maximizing the
discriminant functions g of our hybrid model in the same way as for a mixture model as explained above.
Thus, we maximize the following objective function for estimating the model parameters 0 for generative models ofunlabeled data:
where p(0) is a prior probability distribution of 0.
Here, the discriminant function g of output y given input x in our hybrid model can be obtained by the numerator on the third line of Equation (2), since the denominator does not affect the determination of y, that is,
such that
+ log p(&").
Since Q(0', 0'; r) is independent of 0'', we can improve the value of G(0\r) by computing 0'' to maximize Q(0'', 0'; r).
We can obtain a 0 estimate by iteratively performing this update while G(0\r) is hill climbing.
As shown in Equation (5), R is used for estimating the parameter 0.
The intuitive effect of maximizing Equation (4) is similar to performing 'soft-clustering'.
That is, unlabeled data is clustered with respect to the R distribution, which also includes information about labeled data, under the constraint of generative model structures.
3.3 Parameter Estimation Procedure
According to our definition, the 0 and r estimations are mutually dependent.
That is, the parameters of the hybrid model, r, should be estimated
4.Perform the following until J-^Q(t)^- < £.
under fixed r(t) and A using Du.
under fixed 0(t+1) and A using D''.
4.3. t <- t + 1.
5.Output a structured predictor R(y\x, A, ®(t), r(t)).
Figure 1: Algorithm of learning model parameters used in our hybrid model.
using Equation (3) with a fixed 0, while the parameters of the generative models, 0, should be estimated using Equation (4) with a fixed r. As a solution to our parameter estimation, we search for the 0 and r that maximize £HySOL(r\0) and G(0\r) simultaneously.
For this search, we compute and r by maximizing the objective functions shown in Equations (4) and (3) iteratively and alternately.
We summarize the algorithm for estimating these model parameters in Figure 1.
Note that during the r estimation (procedure 4.2 in Figure 1), r can be over-fitted to the labeled training data if we use the same labeled training data as used for the A estimation.
There are several possible ways to reduce this over-fit.
In this paper, we select one of the simplest; we divide the labeled training data Dl into two distinct sets V[ and V".
Then, V[ and Dl' are individually used for estimating A and r, respectively.
In our experiments, we divide the labeled training data Vl so that 4/5 is used for V[ and the remaining 1/5 for V''.
3.4 Efficient Parameter Estimation Algorithm
Let Nr(x) represent the denominator of Equation (2), that is the normalization factor of R. We can rearrange Equation (2) as follows:
where ViDs represents the potential function of the s-th position of the sequence in the i-th CRF and VGS represents the probability of the s-th position in the j-th HMM, that is, ViDs = exp(Xi • f s) and
VGS = 0ys-1 ,ysdys,xs, respectively.
See the Appendix for the derivation of Equation (6) from Equation (2).
To estimate r(*+1), namely procedure 4.2 in Figure 1, we employ the derivatives with respect to Yi and Yj shown in Equation (6), which are the parameters of the discriminative and generative models, respectively.
Thus, we obtain the following derivatives with respect to Yi:
n n - ER(y\xn;A,®,T)['^2 log VDS] .
The first and second terms are constant during iterative procedure 4 in our optimization algorithm shown in Figure 1.
Thus, we only need to calculate these values once at the beginning of procedure 4.
Let as(y) and (3s(y) represent the forward and backward state costs at position s with output y for corresponding input x. Let Vs(y,y') represent the products of the total value of the transition cost between s—1 and s with labels y and y' in the corresponding input sequence, that is, Vs(y,y') = Yli[VDs(y, y'Ylj [VGs(y, y')]Yj.
The third term, which indicates the expectation of potential functions, can be rewritten in the form of a forward-backward algorithm, that is,
where ZR(x) represents the partition function of our hybrid model, that is, ZR(x)=NR(x) Ui[Zi(x)]Yi.
Hence, the calculation of derivatives with respect to Yi is tractable since we can incorporate the same forward-backward algorithm as that used in a standard CRF.
Then, the derivatives with respect to Yj, which are the parameters of generative models, can be written as follows:
= log pf (x», y»)-^2 ER(y\xn.A:@r)^^2 log VGs] .
Again, the second term, which indicates the expectation of transition probabilities and symbol emission probabilities, can be rewritten in the form of a forward-backward algorithm in the same manner as
Y%, where the only difference is that ViDs is substituted by VGs in Equation (7).
To estimate 0(*+1), which is procedure 4.1 in Figure 1, the same forward-backward algorithm as used in standard HMMs is available since the form of our Q-function shown in Equation (5) is the same as that of standard HMMs.
The only difference is that our method uses marginal probabilities given by R instead of the p(x, y ; 0) of standard HMMs.
Therefore, only a forward-backward algorithm is required for the efficient calculation of our parameter estimation process.
Note that even though our hybrid model supports the use of a combination of several generative and discriminative models, we only need to calculate the forward-backward algorithm once for each sample during optimization procedures 4.1 and 4.2.
This means that the required number of executions of the forward-backward algorithm for our parameter estimation is independent of the number of models used in the hybrid model.
In addition, after training, we can easily merge all the parameter values in a single parameter vector.
This means that we can simply employ the Viterbi-algorithm for evaluating unseen samples, as well as that of standard CRFs, without any additional cost.
4 Experiments
We examined our hybrid model (HySOL) by applying it to two sequence labeling tasks, named entity recognition (NER) and syntactic chunking (Chunking).
We used the same Chunking and 'English' NER data as those used for the shared tasks of CoNLL-2000 (Tjong Kim Sang and Buchholz, 2000) and CoNLL-2003 (Tjong Kim Sang and Meulder, 2003), respectively.
For the baseline method, we performed a conditional random field (CRF), which is exactly the same training procedure described in (Sha and Pereira, 2003) with L-BFGS.
Moreover, LOP-CRF (Smith et al., 2005) is also compared with our hybrid model, since the formalism of our hybrid model can be seen as an extension of LOP-CRFs as described in Section 3.
For CRF, we used the Gaussian prior as the second term on the RHS in Equation (1), where 52 represents the hyper-parameter in the Gaussian prior.
In contrast, for LOP-CRF and HySOL, we used the Dirichlet priors as the second term on the
Table 1: Features used in NER experiments
RHS in Equations (3), and (4), where £ and n are the hyper-parameters in each Dirichlet prior.
4.1 Named Entity Recognition Experiments
The English NER data consists of 203,621, 51,362 and 46,435 words from 14,987, 3,466 and 3,684 sentences in training, development and test data, respectively, with four named entity tags, PERSON, LOCATION, ORGANIZATION and MISC, plus the 'O' tag.
The unlabeled data consists of 17,003,926 words from 1,029,122 sentences.
These data sets are exactly the same as those provided for the shared task ofCoNLL-2003.
We slightly extended the feature set of the supplied data by adding feature types such as 'word type', and word prefix and suffix.
Examples of 'word type' include whether the word is capitalized, contains digit or contains punctuation, which basically follows the baseline features of (Sutton et al., 2006) without regular expressions.
Note that, unlike several previous studies, we did not employ additional information from external resources such as gazetteers.
All our features can be automatically extracted from the supplied data.
For LOP-CRF and HySOL, we used four base discriminative models trained by CRFs with different feature sets.
Table 1 shows the feature sets we used for training these models.
The design of these feature sets was derived from a suggestion in (Smith et al., 2005), which exhibited the best performance in the several feature division.
Note that the CRF for the comparison method was trained by using all fea-
all of the above
Table 2: Features used in Chunking experiments
ture types, namely the same as A4.
As we explained in Section 3.3, for training HySOL, the parameters of four discriminative models, A, were trained from 4/5 of the labeled training data, and r were trained from remaining 1/5.
For the features of the generative models, we used all of the feature types shown in Figure 1.
Note that one feature type corresponds to one HMM.
Thus, each HMM maintains to consist ofa non-overlapping feature set since each feature type only generates one symbol per state.
4.2 Syntactic Chunking Experiments
CoNLL-2000 Chunking data was obtained from the Wall Street Journal (WSJ) corpus: sections 15-18 as training data (8,936 sentences and 211,727 words), and section 20 as test data (2,012 sentences and 47,377 words), with 11 different chunk-tags, such as NP and VP plus the 'O' tag, which represents the region outside any target chunk.
For LOP-CRF and HySOL, we also used four base discriminative models trained by CRFs with different feature sets.
Table 2 shows the feature set we used in the Chunking experiments.
We used the feature set of the supplied data without any extension of additional feature types.
To train HySOL, we used the same unlabeled data as used for our NER experiments (17,003,926 words from the Reuters corpus).
Moreover, the division of the labeled training data and the feature set of the generative models were derived in the same manner as our NER experiments (see Section 4.1).
That is, we divided the labeled training data into 4/5 for estimating A and 1/5 for estimating T; one feature type shown in Table 2 is assigned in one generative model.
methods (hyper-params)
5 Results and Discussion
We evaluated the performance in terms of the Fp=1 score, which is the evaluation measure used in CoNLL-2000 and 2003, and sentence accuracy, since all the methods in our experiments optimize sequence loss.
Tables 3 and 4 show the results of the NER and Chunking experiments, respectively.
The Fp=1 and 'Sent' columns show the performance evaluated using the Fp=1 score and sentence accuracy, respectively.
52, £ and n, which are the hyperparameters in Gaussian or Dirichlet priors, are selected from a certain value set by using a develop-mentset1,thatis, 52 e (0.01, 0.1,1,10,100,1000}, £ - 1 = £ e (0.01,0.1,1,10} and n - 1 = i e (0.00001, 0.0001, 0.001, 0.01}.
The second rows of CRF in Tables 3 and 4 represent the performance of base discriminative models used in HySOL with all the features, which are trained with 4/5 of the labeled training data.
The third rows of HySOL show the performance obtained without using generative models (unlabeled data).
The model itself is essentially the same as LOP-CRFs.
However the performance in the third HySOL rows was consistently lower than that of LOP-CRF since the discriminative models in HySOL are trained with 4/5 labeled data.
1 Chunking (CoNLL-2000) data has no common development set.
Thus, our preliminary examination employed by using 4/5 labeled training data with the remaining 1/5 as development data to determine the hyper-parameter values.
Figure 2: Changes in the performance and the convergence condition value (procedure 4 in Figure 1) of HySOL.
cantly improved the performance of supervised setting, CRF and LOP-CRF, as regards both NER and Chunking experiments.
5.1 Impact of Incorporating Unlabeled Data
The contributions provided by incorporating unla-beled data in our hybrid model can be seen by comparison with the performance of the first and third rows in HySOL, namely a 2.64 point F-score and a 2.96 point sentence accuracy gain in the NER experiments and a 0.46 point F-score and a 1.99 point sentence accuracy gain in the Chunking experiments.
We believe there are two key ideas that enable the unlabeled data in our approach to exhibit this improvement compared with the the state-of-the-art performance provided by discriminative models in supervised settings.
First, unlabeled data is only used for optimizing Equation (4) to obtain a similar effect to 'soft-clustering', which can be calculated without information about the correct output.
Second, by using a combination of generative models, we can enhance the flexibility of the feature design for unlabeled data.
For example, we can handle arbitrary overlapping features, similar to those used in discriminative models, for unlabeled data by assigning one feature type for one generative model as in our experiments.
5.2 Impact of Iterative Parameter Estimation
Figure 2 shows the changes in the performance and the convergence condition value of HySOL during parameter estimation iteration in our NER and Chunking experiments, respectively.
As shown in the figure, HySOL was able to reach the conver-
gence condition in a small number of iterations in our experiments.
Moreover, the change in the performance remains quite stable during the iteration.
However, theoretically, our optimization procedure is not guaranteed to converge in the r and 0 space, since the optimization of 0 has local maxima.
Even if we were unable to meet the convergence condition, we were easily able to obtain model parameters by performing a sufficient fixed number of iterations, and then select the parameters when Equation (4) obtained the maximum objective value.
When we consider semi-supervised SOL methods, SS-CRF-MER (Jiao et al., 2006) is the most competitive with HySOL, since both methods are defined based on CRFs.
We planned to compare the performance with that of SS-CRF-MER in our NER and Chunking experiments.
Unfortunately, we failed to implement SS-CRF-MER since it requires the use of a slightly complicated algorithm, called the 'nested' forward-backward algorithm.
Although, we cannot compare the performance, our hybrid approach has several good characteristics compared with SS-CRF-MER.
First, it requires a higher order algorithm, namely a 'nested' forward-backward algorithm, for the parameter estimation of unlabeled data whose time complexity is O(L3S2) for each unlabeled data, where L and S represent the output label size and unlabeled sample length, respectively.
Thus, our hybrid approach is more scalable for the size of unlabeled data, since HySOL only needs a standard forward-backward algorithm whose time complexity is O(L2S).
In fact, we still have a question as to whether SS-CRF-MER is really scalable in practical time for such a large amount of unlabeled data as used in our experiments, which is about 680 times larger than that of (Jiao et al., 2006).
Scalability for unlabeled data will become really important in the future, as it will be natural to use millions or billions of unlabeled data for further improvement.
Second, SS-CRF-MER has a sensitive hyper-parameter in the objective function, which controls the influence of the un-labeled data.
In contrast, our objective function only has a hyper-parameter of prior distribution, which is widely used for standard MAP estimation.
Moreover, the experimental results shown in Tables 3 and
their own large gazetteers, 2M-word labeled data
their own large gazetteers, very elaborated features
unlabeled data (17M words) supplied gazetters
additional resources
ASO-semi
full parser output
4 indicate that HySOL is rather robust with respect to the hyper-parameter since we can obtain fairly good performance without a prior distribution.
5.4 Comparison with Previous Top Systems
With respect to the performance of NER and Chunking tasks, the current best performance is reported in (Ando and Zhang, 2005), which we refer to as 'ASO-semi', as shown in Figures 5 and 6.
ASO-semi also incorporates unlabeled data solely for the additional information in the same way as our method.
Unfortunately, our results could not reach their level of performance, although the size and source of the unlabeled data are not the same for certain reasons.
First, (Ando and Zhang, 2005) does not describe the unlabeled data used in their NER experiments in detail, and second, we are not licensed to use the TREC corpus including WSJ unlabeled data that they used for their Chunking experiments (training and test data for Chunking is derived from WSJ).
Therefore, we simply used the supplied unla-beled data of the CoNLL-2003 shared task for both NER and Chunking.
If we consider the advantage of our approach, our hybrid model incorporating generative models seems rather intuitive, since it is sometimes difficult to find out a design of effective auxiliary problems for the target problem.
Interestingly, the additional information obtained
+ supplied gazetters
Table 7: The HySOL performance with the F-score optimization technique and some additional resources in NER (CoNLL-2003) experiments
Table 8: The HySOL performance with the F-score optimization technique on Chunking (CoNLL-2000) experiments
from unlabeled data appear different from each other.
ASO-semi uses unlabeled data for constructing auxiliary problems to find the 'shared structures' of auxiliary problems that are expected to improve the performance of the main problem.
Moreover, it is possible to combine both methods, for example, by incorporating the features obtained with their method in our base discriminative models, and then construct a hybrid model using our method.
Therefore, there may be a possibility of further improving the performance by this simple combination.
In NER, most of the top systems other than ASO-semi boost performance by employing external hand-crafted resources such as large gazetteers.
This is why their results are superior to those obtained with HySOL.
In fact, if we simply add the gazetteers included in CoNLL-2003 supplied data as features, HySOL achieves 88.14.
5.5 Applying F-score Optimization Technique
In addition, we can simply apply the F-score optimization technique for the sequence labeling tasks proposed in (Suzuki et al., 2006) to boost the HySOL performance since the base discriminative models pD(y\x) and discriminative combination, namely Equation (3), in our hybrid model basically uses the same optimization procedure as CRFs.
Tables 7 and 8 show the F-score gain when we apply the F-score optimization technique.
As shown in the Tables, the F-score optimization technique can easily improve the (F-score) performance without any additional resources or feature engineering.
In NER, we also examined HySOL with additional resources to observe the performance gain.
The third row represents the performance when we add approximately 10M words ofunlabeled data (total 27M words)2 that are derived from 1996/11/1530 articles in Reuters corpus.
Then, the fourth and fifth rows represent the performance when we add the supplied gazetters in the CoNLL-2003 data as features, and adding development data as training data of r. In this case, HySOL achieved a comparable performance to that of the current best system, ASO-semi, in both NER and Chunking experiments even though the NER experiment is not a fair comparison since we added additional resources (gazetters and dev. set) that ASO-semi does not use in training.
6 Conclusion and Future Work
We proposed a framework for semi-supervised SOL based on a hybrid generative and discriminative approach.
Experimental results showed that incorporating unlabeled data in a generative manner has the power to further improve on the state-of-the-art performance provided by supervised SOL methods such as CRFs, with the help of our hybrid approach, which discriminatively combines with discriminative models.
In future we intend to investigate more appropriate model and feature design for unlabeled data, which may further improve the performance achieved in our experiments.
Appendix
2In order to keep the consistency of POS tags, we reattached POS tags of the supplied data set and new 10M words
of unlabeled data using a POS tagger trained from WSJ corpus.
