We cannot use non-local features with current major methods of sequence labeling such as CRFs due to concerns about complexity.
We propose a new perceptron algorithm that can use non-local features.
Our algorithm allows the use of all types of non-local features whose values are determined from the sequence and the labels.
The weights of local and non-local features are learned together in the training process with guaranteed convergence.
We present experimental results from the CoNLL 2003 named entity recognition (NER) task to demonstrate the performance of the proposed algorithm.
1 Introduction
Many NLP tasks such as POS tagging and named entity recognition have recently been solved as sequence labeling.
Discriminative methods such as Conditional Random Fields (CRFs) (Lafferty et al., 2001), Semi-Markov Random Fields (Sarawagi and Cohen, 2004), and perceptrons (Collins, 2002a) have been popular approaches for sequence labeling because of their excellent performance, which is mainly due to their ability to incorporate many kinds of overlapping and non-independent features.
However, the common limitation of these methods is that the features are limited to "local" features, which only depend on a very small number of labels (usually two: the previous and the current).
Although this limitation makes training and inference tractable, it also excludes the use of possibly useful "non-local" features that are accessible after all labels are determined.
For example, non-local features such as "same phrases in a document do not
have different entity classes" were shown to be useful in named entity recognition (Sutton and McCal-lum, 2004; Bunescu and Mooney, 2004; Finkel et al., 2005; Krishnan and Manning, 2006).
We propose a new perceptron algorithm in this paper that can use non-local features along with local features.
Although several methods have already been proposed to incorporate non-local features (Sutton and McCallum, 2004; Bunescu and Mooney, 2004; Finkel et al., 2005; Roth and Yih, 2005; Krishnan and Manning, 2006; Nakagawa and Matsumoto, 2006), these present a problem that the types of non-local features are somewhat constrained.
For example, Finkel et al. (2005) enabled the use of non-local features by using Gibbs sampling.
However, it is unclear how to apply their method of determining the parameters of a non-local model to other types of non-local features, which they did not used.
Roth and Yih (2005) enabled the use of hard constraints on labels by using integer linear programming.
However, this is equivalent to only allowing non-local features whose weights are fixed to negative infinity.
Krishnan and Manning (2006) divided the model into two CRFs, where the second model uses the output of the first as a kind of non-local information.
However, it is not possible to use non-local features that depend on the labels of the very candidate to be scored.
Nakagawa and Matsumoto (2006) used a Bolzmann distribution to model the correlation of the POS of words having the same lexical form in a document.
However, their method can only be applied when there are convenient links such as the same lexical form.
Since non-local features have not yet been extensively investigated, it is possible for us to find new useful non-local features.
Therefore, our objective in this study was to establish a framework, where all
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 315-324, Prague, June 2007.
©2007 Association for Computational Linguistics
types of non-local features are allowed.
With non-local features, we cannot use efficient procedures such as forward-backward procedures and the Viterbi algorithm that are required in training CRFs (Lafferty et al., 2001) and perceptrons (Collins, 2002a).
Recently, several methods (Collins and Roark, 2004; Daume III and Marcu, 2005; McDonald and Pereira, 2006) have been proposed with similar motivation to ours.
These methods alleviate this problem by using some approximation in perceptron-type learning.
In this paper, we follow this line of research and try to solve the problem by extending Collins' per-ceptron algorithm (Collins, 2002a).
We exploited the not-so-familiar fact that we can design a per-ceptron algorithm with guaranteed convergence if we can find at least one wrong labeling candidate even if we cannot perform exact inference.
We first ran the A* search only using local features to generate n-best candidates (this can be efficiently performed), and then we only calculated the true score with non-local features for these candidates to find a wrong labeling candidate.
The second key idea was to update the weights of local features during training if this was necessary to generate sufficiently good candidates.
The proposed algorithm combined these ideas to achieve guaranteed convergence and effective learning with non-local features.
The remainder of the paper is organized as follows.
Section 2 introduces the Collins' perceptron algorithm.
Although this algorithm is the starting point for our algorithm, its baseline performance is not outstanding.
Therefore, we present a margin extension to the Collins' perceptron in Section 3.
This margin perceptron became the direct basis of our algorithm.
We then explain our algorithm for nonlocal features in Section 4.
We report the experimental results using the CoNLL 2003 shared task dataset in Section 6.
2 Perceptron Algorithm for Sequence Labeling
where • denotes the inner product.
The aim of the learning algorithm is to obtain an appropriate weight vector, a, given training set
{(x1, y1), — , (xL, y*L)}.
The learning algorithm, which is illustrated in Collins (2002a), proceeds as follows.
The weight vector is initialized to zero.
The algorithm passes over the training examples, and each sequence is decoded using the current weights.
If y' is not the correct answer y*, the weights are updated according to the following rule.
anew = a + $(x, y*) - $(x, y').
the number of updates is at most R2/52.
The perceptron algorithm only requires one candidate y' for each sequence xi, unlike the training of CRFs where all possible candidates need to be considered.
This inherent property is the key to training with non-local features.
However, note that the tractability of learning and inference relies on how efficiently y' can be found.
In practice, we can find y' efficiently using a Viterbi-type algorithm only when the features are all local, i.e., $s(x, y) can be written as the sum of (two label) local features (f)s as $s(x, y) = YI J ff>s(x,yi-i,yi).
This locality constraint is also required to make the training of CRFs
tractable (Lafferty et al., 2001).
One problem with the perceptron algorithm described so far is that it offers no treatment for over-fitting.
Thus, Collins (2002a) also proposed an averaged perceptron, where the final weight vector is
1 Collins (2002a) also provided proof that guaranteed "good" learning for the non-separable case.
However, we have only considered the separable case throughout the paper.
Algorithm 3.1: Perceptron with margin for sequence labeling (parameters: C)
margin 5 (at the cost of infinite training time), as C -+oo.
Note that if the features are all local, the second-best candidate (generally n-best candidates) can also be found efficiently by using an A* search that uses the best scores calculated during a Viterbi search as the heuristic estimation (Soong and Huang, 1991).
There are other methods for improving robustness by making margin larger for the structural output problem.
Such methods include ALMA (Gentile, 2001) used in (Daume III and Marcu, 2005)2, MIRA (Crammer et al., 2006) used in (McDonald et al., 2005), and Max-Margin Markov Networks (Taskar et al., 2003).
However, to the best of our knowledge, there has been no prior work that has applied a per-ceptron with a margin (Krauth and Mezard, 1987) to structured output.3 Our method described in this section is one of the easiest to implement, while guaranteeing a large margin.
We found in the experiments that our method outperformed the Collins' averaged perceptron by a large margin.
4 Algorithm
4.1 Definition and Basic Idea
Having described the basic perceptron algorithms, we will know explain our algorithm that learns the weights of local and non-local features in a unified way.
Ideally, we want to determine the labels using the whole feature set as:
y1 = &rgm&xyey}x\&a(x, y) ■ a.
3For re-ranking problems, Shen and Joshi (2004) proposed a perceptron algorithm that also uses margins.
The difference is that our algorithm trains the sequence labeler itself and is much simpler because it only aims at labeling.
the average of all weight vectors during training.
Howerver, we found in our experiments that the averaged perceptron performed poorly in our setting.
We therefore tried to make the perceptron algorithm more robust to overfitting.
We will describe our extension to the perceptron algorithm in the next section.
3 Margin Perceptron Algorithm for Sequence Labeling
We extended a perceptron with a margin (Krauth and Mezard, 1987) to sequence labeling in this study, as Collins (2002a) extended the perceptron algorithm to sequence labeling.
In the case of sequence labeling, the margin is defined as:
Assuming that the best candidate, y', equals the correct answer, y*, the margin can be re-written as:
where y'' = 2nd-besty &(x{, y) ■ a. Using this relation, the resulting algorithm becomes Algorithm 3.1.
The algorithm tries to enlarge the margin as much as possible, as well as make the best scoring candidate equal the correct answer.
Algorithm 4.1: Candidatealgorithm(parameters:
However, if there are non-local features, it is impossible to find the highest scoring candidate efficiently, since we cannot use the Viterbi algorithm.
Thus, we cannot use the perceptron algorithms described in the previous sections.
The training of CRFs is also intractable for the same reason.
To deal with this problem, we first relaxed our objective.
The modified objective was to find a good model from those with the form:
That is, we first generate n-best candidates {yn} under the local model, $l(x, y) ■ a. This can be done efficiently using the A* algorithm.
We then find the best scoring candidate under the total model, &a(x, y) ■ a, only from these n-best candidates.
If n is moderately small, this can also be done in a practical amount of time.
This resembles the re-ranking approach (Collins and Duffy, 2002; Collins, 2002b).
However, unlike the re-ranking approach, the local model, &l(x, y) ■ a, and the total model, &a(x, y) ■ a, correlate since they share a part of the vector and are trained at the same time in our algorithm.
The re-ranking approach has the disadvantage that it is necessary to use different training corpora for the first model and for the second, or to use cross validation type training, to make the training for the second meaningful.
This reduces the effective size of training data or increases training time substantially.
On the other hand, our algorithm has no such disadvantage.
However, we are no longer able to find the highest scoring candidate under &a(x, y) ■ a exactly with this approach.
We cannot thus use the percep-tron algorithms directly.
However, by examining the
proofs in Collins (2002a), we can see that the essential condition for convergence is that the weights are always updated using some y (= y*) that satisfies:
That is, y does not necessarily need to be the exact best candidate or the exact second-best candidate.
The algorithm also converges in a finite number of iterations even with Eq.
(1) as long as Eq.
(2) is satisfied.
4.2 Candidate Algorithm
The algorithm we came up with first based on the above idea, is Algorithm 4.1.
We first find the n-best candidates using the local model, $l(x, y) ■ a. At this point, we can determine the value of the nonlocal features, & (x, y), to form the whole feature vector, &a(x, y), for the n-best candidates.
Next, we re-score and sort them using the total model, &a(x, y) ■ a, to find a candidate that violates the margin condition.
We call this algorithm the "candidate algorithm".
After the training has finished, &a(x{, y*) ■ a - &a(x{, y) ■ a > C is guaranteed for all (x{, y) where y £ {yn}, y = y*.
At first glance, this seems sufficient condition for good models.
However, this is not true because if y* £ {yn}, the inference defined by Eq.
(1) is not guaranteed to find the correct answer, y*.
In fact, this algorithm does not work well with non-local features as we found in the experiments.
Our idea for improving the above algorithm is that the local model, &l(x, y ) a, must at least be so good that y* £ {yn}.
To achieve this, we added a modification term that was intended to improve the local model when the local model was not good enough even when the total model was good enough.
The final algorithm resulted in Algorithm 4.2.
As can be seen, the part marked (B) has been added.
We call this algorithm the "proposed algorithm".
Note that the algorithm prioritizes the update of the total model, (A), over that of the local model, (B), although the opposite is also possible.
Also note that the update of the local model in (B) is "aggressive" since it updates the weights until the best candidate output by the local model becomes the correct answer and satisfies the margin condition.
A "conservative" updating, where we cease the update when the n-best candidates contain the correct answer, is also possible from our idea above.
We made these choices since they worked better than the other alternatives.
The tunable parameters are the local margin parameter, Cl, the total margin parameter, Ca, and n for the n-best search.
We used C = Cl = Ca in this study to reduce the search space.
Vi, Vy £Yx \\&a(xi, yi*) - &a(xi, y)\\< R.
See Appendix A for the proofs.
We also incorporated the idea behind Bayes point machines (BPMs) (Herbrich and Graepel, 2000) to improve the robustness of our method further.
BPMs try to cancel out overfitting caused by the order of
examples, by training several models by shuffling the training examples.4 However, it is very time consuming to run the complete training process several times.
We thus ran the training in only one pass over the shuffled examples several times, and used the averaged output weight vectors as a new initial weight vector, because we thought that the early part of training would be more seriously affected by the order of examples.
We call this "BPM initialization".
5 Named Entity Recognition and Non-Local Features
We evaluated the performance of the proposed algorithm using the named entity recognition task.
We adopted IOB (IOB2) labeling (Ramshaw and Marcus, 1995), where the first word of an entity of class "C" is labeled "B-C", the words in the entity are labeled "I-C", and other words are labeled "O".
We used non-local features based on Finkel et al. (2005).
These features are based on observations such as "same phrases in a document tend to have the same entity class" (phrase consistency) and "a sub-phrase of a phrase tends to have the same entity class as the phrase" (sub-phrase consistency).
We also implemented the "majority" version of these features as used in Krishnan and Manning (2006).
In addition, we used non-local features, which are based on the observation that "entities tend to have the same entity class if they are in the same conjunctive or disjunctive expression" as in "■ ■ ■ in U.S., EU, and Japan" (conjunction consistency).
This type of non-local feature was not used by Finkel et al.
(2005) or Krishnan and Manning (2006).
6 Experiments
4The results for the perceptron algorithms generally depend on the order of the training examples.
5Note that we can prove that the perceptron algorithms converge even though the weight vector is not initialized as a = 0.
and 3,684 sentences, respectively).
Automatically assigned POS tags and chunk tags are also provided.
The CoNLL 2003 dataset contains document boundary markers.
We concatenated the sentences in the same document according to these markers.6 This generated 964 documents for the training set, 216 documents for the development set, and 231 documents for the testing set.
The documents generated as above become the sequence, x, in the learning algorithms.
We first evaluated the baseline performance of a CRF model, the Collins' perceptron, and the Collins' averaged perceptron, as well as the margin perceptron, with only local features.
We next evaluated the performance of our perceptron algorithm proposed for non-local features.
We used the local features summarized in Table 1, which are similar to those used in other studies on named entity recognition.
We omitted features whose surface part listed in Table 1 occurred less than twice in the training corpus.
We used Gaussian regularization (Chen and Rosenfeld, 2000) for CRF training to avoid overfit-ting.
The parameter of the Gaussian, <r2, was tuned using the development set.
We also tuned the margin parameter, C, for the margin perceptron algorithm.9 The convergence of CRF training was determined by checking the log-likelihood of the model.
The convergence of perceptron algorithms was determined by checking the per-word labeling error, since the
6We used sentence concatenation even when only using local features, since we found it does not degrade accuracy (rather we observed a slight increase).
8We also replaced the optimization module in the original package with that used in the Amis maximum entropy estimator (http://www-tsujii.is.s.u-tokyo.ac.jp/amis) since we encountered problems with the provided module in some cases.
9For the Gaussian parameter, we tested {13, 25 , 50, 100, 200, 400, 800} (the accuracy did not change drastically among these values and it seems that there is no accuracy hump even if we use smaller values).
We tested {500, 1000, 1414, 2000, 2828, 4000, 5657, 8000, 11313, 16000, 32000} for the margin parameters.
Edge features:
Bigram node features:
number of updates was not zero even after a large number of iterations in practice.
We stopped training when the relative change in these values became less than a pre-defined threshold (0.0001) for at least three iterations.
We used n = 20 (n of the n-best) for training since we could not use too a large n because it would have slowed down training.
However, we could examine a larger n during testing, since the testing time did not dominate the time for the experiment.
We found an interesting property for n in our preliminary experiment.
We found that an even larger n in testing (written as n') achieved higher accuracy, although it is natural to assume that the same n that was used in training would also be appropriate for testing.
We thus used n' = 100 to evaluate performance during parameter tuning.
After finding the best C with n' = 100, we varied n' to investigate its
Table 2: Summary of performance (Fi).
Table 3: Effect of n'.
Proposed (n
Table 2 compares the results.
CRF outperformed the perceptron by a large margin.
Although the averaged perceptron outperformed the perceptron, the improvement was slight.
However, the margin per-ceptron greatly outperformed compared to the averaged perceptron.
Yet, CRF still had the best baseline performance with only local features.
The proposed algorithm with non-local features improved the performance on the test set by 0.66 points over that of the margin perceptron without non-local features.
The row "Candidate" refers to the candidate algorithm (Algorithm 4.1).
From the results for the candidate algorithm, we can see that the modification part, (B), in Algorithm 4.2 was essential to make learning with non-local features effective.
We next examined the effect of n'.
As can be seen from Table 3, an n' larger than that for training yields higher performance.
The highest performance with the proposed algorithm was achieved when n' = 6400, where the improvement due to non-local features became 0.74 points.
The performance of the related work (Finkel et al., 2005; Krishnan and Manning, 2006) is listed in Table 4.
We can see that the final performance of our algorithm was worse than that of the related work.
We changed the experimental setting slightly to investigate our algorithm further.
Instead of
Table 4: The performance of the related work.
baseline CRF
Table 5: Summary of performance with POS/chunk tags by TagChunk.
Perceptron
Averaged perceptron
Margin perceptron
dataset, we used the tags assigned by TagChunk (Daume III and Marcu, 2005)10 with the intention of using more accurate tags.
The results with this setting are summarized in Table 5.
Performance was better than that in the previous experiment for all algorithms.
We think this was due to the quality of the POS/chunk tags.
It is interesting that the effect of non-local features rose to 0.93 points with n' = 6400, even though the baseline performance was also improved.
The resulting performance of the proposed algorithm with non-local features is higher than that of Finkel et al. (2005) and comparable with that of Krishnan and Manning (2006).
This comparison, of course, is not fair because the setting was different.
However, we think the results demonstrate a potential of our new algorithm.
Table 6: Comparison with re-ranking approach.
Table 7: Comparison of training time (C = 5657).
local features
Margin Perceptron
+ non-local features
91.17/86.08 (i.e., dropped for the evaluation set as expected), in the setting of the experiment of Table 5.
Since the effect of BPM initialization is not conclusive only from these results, we need more experiments on this.
6.3 Comparison with re-ranking approach
Finally, we compared our algorithm with the re-ranking approach (Collins and Duffy, 2002; Collins, 2002b), where we first generate the n-best candidates using a model with only local features (the first model) and then re-rank the candidates using a model with non-local features (the second model).
We implemented two re-ranking models, "re-ranking 1" and "re-ranking 2".
These models differ in how to incorporate the local information in the second model.
"re-ranking 1" uses the score of the first model as a feature in addition to the non-local features as in Collins (2002b).
"re-ranking 2" uses the same local features as the first model11 in addition to the non-local features.
The first models were trained using the margin perceptron algorithm in Algorithm 3.1.
The second models were trained using the algorithm, which is obtained by replacing {yn} with the n-best candidates by the first model.
The first model used to generate n-best candidates for the development set and the test set was trained using the whole training data.
However, CRFs or percep-trons generally have nearly zero error on the training data, although the first model should mis-label
11 The weights were re-trained for the second model.
to some extent to make the training of the second model meaningful.
To avoid this problem, we adopt cross-validation training as used in Collins (2002b).
We split the training data into 5 sets.
We then trained five first models using 4/5 of the data, each of which was used to generate n-best candidates for the remaining 1/5 of the data.
As in the previous experiments, we tuned C using the development set with n' = 100 and then tested other values for n'.
Table 6 shows the results.
As can be seen, re-ranking models were outperformed by our proposed algorithm, although they also outperformed the margin perceptron with only local features ("re-ranking 2" seems better than "re-ranking 1").
Table 7 shows the training time of each algorithm.12 Our algorithm is much faster than the re-ranking approach that uses cross-validation training, while achieving the same or higher level of performance.
7 Discussion
As we mentioned, there are some algorithms similar to ours (Collins and Roark, 2004; Daume III and Marcu, 2005; McDonald and Pereira, 2006; Liang et al., 2006).
The differences of our algorithm from these algorithms are as follows.
Daume III and Marcu (2005) presented the method called LaSO (Learning as Search Optimization), in which intractable exact inference is approximated by optimizing the behavior of the search process.
The method can access non-local features at each search point, if their values can be determined from the search decisions already made.
They provided robust training algorithms with guaranteed convergence for this framework.
However, a difference is that our method can use non-local features whose value depends on all labels throughout training, and it is unclear whether the features whose values can only be determined at the end of the search (e.g., majority features) can be learned effectively with such an incremental manner of LaSO.
The algorithm proposed by McDonald and Pereira (2006) is also similar to ours.
Their target was non-projective dependency parsing, where exact inference is intractable.
Instead of using
5657.
n-best/re-scoring approach as ours, their method modifies the single best projective parse, which can be found efficiently, to find a candidate with higher score under non-local features.
Liang et al. (2006) used n candidates of a beam search in the Collins' perceptron algorithm for machine translation.
Collins and Roark (2004) proposed an approximate incremental method for parsing.
Their method can be used for sequence labeling as well.
These studies, however, did not explain the validity of their updating methods in terms of convergence.
(Crammer et al., 2006).
On the other hand, we employed the margin perceptron (Krauth and Mezard, 1987), extending it to sequence labeling.
We demonstrated that this greatly improved robustness.
With regard to the local update, (B), in Algorithm 4.2, "early updates" (Collins and Roark, 2004) and "y-good" requirement in (Daume III and Marcu, 2005) resemble our local update in that they tried to avoid the situation where the correct answer cannot be output.
Considering such commonality, the way of combining the local update and the non-local update might be one important key for further improvement.
It is still open whether these differences are advantages or disadvantages.
However, we think our algorithm can be a contribution to the study for incorporating non-local features.
The convergence guarantee is important for the confidence in the training results, although it does not mean high performance directly.
Our algorithm could at least improve the accuracy of NER with non-local features and it was indicated that our algorithm was superior to the re-ranking approach in terms of accuracy and training cost.
However, the achieved accuracy was not better than that of related work (Finkel et al., 2005; Krishnan and Manning, 2006) based on CRFs.
Although this might indicate the limitation of perceptron-based methods, it has also been shown that there is still room for improvement in perceptron-based algorithms as our margin percep-tron algorithm demonstrated.
8 Conclusion
In this paper, we presented a new perceptron algorithm for learning with non-local features.
We think the proposed algorithm is an important step towards achieving our final objective.
We would like to investigate various types of new non-local features using the proposed algorithm in future work.
Appendix A: Convergence of Algorithm 4.2
to derive R2.
&a* a &a a mm--^-jf-a > C5/(2C + R ).
13We use the shorthand &a* = &a(xi, y*), &a = &a(xi, y), = &l(xi, y*), and &l = &l(xi, y) where y represents the candidate used to update (y', y", y1, or y2).
