We describe an incremental parser that was trained to minimize cost over sentences rather than over individual parsing actions.
This is an attempt to use the advantages of the two top-scoring systems in the CoNLL-X shared task.
In the evaluation, we present the performance of the parser in the Multilingual task, as well as an evaluation of the contribution of bidirectional parsing and beam search to the parsing performance.
1 Introduction
The two best-performing systems in the CoNLL-X shared task (Buchholz and Marsi, 2006) can be classified along two lines depending on the method they used to train the parsing models.
Although the parsers are quite different, their creators could report near-tie scores.
The approach of the top system (McDonald et al., 2006) was to fit the model to minimize cost over sentences, while the second-best system (Nivre et al., 2006) trained the model to maximize performance over individual decisions in an incremental algorithm.
This difference is a natural consequence of their respective parsing strategies: CKY-style maximization of link score and incremental parsing.
In this paper, we describe an attempt to unify the two approaches: an incremental parsing strategy that is trained to maximize performance over sentences rather than over individual parsing actions.
of input words W, and builds the parse tree incrementally using a set of parsing actions (see Table 1).
It can be shown that Nivre's parser creates projective and acyclic graphs and that every projective dependency graph can be produced by a sequence of parser actions.
In addition, the worst-case number of actions is linear with respect to the number of words in the sentence.
2.2 Handling Nonprojective Parse Trees
While the parsing algorithm produces projective trees only, nonprojective arcs can be handled using a preprocessing step before training the model and a postprocessing step after parsing the sentences.
The projectivization algorithm (Nivre and Nils-son, 2005) iteratively moves each nonprojective arc upward in the tree until the whole tree is projective.
To be able to recover the nonprojective arcs after parsing, the projectivization operation replaces the labels of the arcs it modifies with traces indicating which links should be moved and where attach to attach them (the "Head+Path" encoding).
The model is trained with these new labels that makes it possible to carry out the reverse operation and produce nonprojective structures.
2.3 Bidirectional Parsing
Shift-reduce is by construction a directional parser, typically applied from left to right.
To make better use of the training set, we applied the algorithm in both directions as Johansson and Nugues (2006) and Sagae and Lavie (2006) for all languages except Catalan and Hungarian.
This, we believe, also has the advantage of making the parser less sensitive to whether the language is head-initial or head-final.
We trained the model on projectivized graphs from left to right and right to left and used a voting strategy based on link scores.
Each link was assigned a score (simply by using the score of the la or ra actions for each link).
To resolve the conflicts
Table 1: Nivre's parser transitions where W is the initial word list; I, the current input word list; A, the graph of dependencies; and S, the stack.
(n', n) denotes a dependency relations between n' and n, where n' is the head and n the dependent.
Parser actions
Conditions
Initialize
Terminate
Left-arc
Right-arc
between the two parses in a manner that makes the tree projective, single-head, rooted, and cycle-free, we applied the Eisner algorithm (Eisner, 1996).
As in our previous parser (Johansson and Nugues, 2006), we used a beam-search extension to Nivre's original algorithm (which is greedy in its original formulation).
Each parsing action was assigned a score, and the beam search allows us to find a better overall score of the sequence of actions.
In this work, we used a beam width of 8 for Catalan, Chinese, Czech, and English and 16 for the other languages.
We model the parsing problem for a sentence x as finding the parse y = argmaxy F(x,y) that maximizes a discriminant function F. In this work, we consider linear discriminants of the following form:
where *(x,y) is a numeric feature representation of the pair (x, y) and w a vector of feature weights.
Learning F in this case comes down to assigning good weights in the vector w.
Machine learning research for similar problems have generally used margin-based formulations.
These include global batch methods such as SVMstruct (Tsochantaridis et al., 2005) as well as online methods such as the Online Passive-Aggressive Algorithm (OPA) (Crammer et al., 2006).
Although the batch methods are formulated very elegantly, they do not seem to scale well to the large training sets prevalent in NLP contexts -
we briefly considered using sVMstruct but training was too time-consuming.
The online methods on the other hand, although less theoretically appealing, can handle realistically sized data sets and have successfully been applied in dependency parsing (McDonald et al., 2006).
Because of this, we used the OPA algorithm throughout this work.
3.2 Implementation
In the online learning framework, the weight vector is constructed incrementally.
At each step, it computes an update to the weight vector based on the current example.
The resulting weight vector is frequently overfit to the last examples.
One way to reduce overfitting is to use the average of all successive weight vectors as the result of the training (Freund and Schapire, 1999).
Algorithm 1 shows the algorithm.
It uses an "aggressiveness" parameter C to reduce overfitting, analogous to the C parameter in SVMs.
The algorithm also needs a cost function p, which describes how much a parse tree deviates from the gold standard.
In this work, we defined p as the sum of link costs, where the link cost was 0 for a correct dependency link with a correct label, 0.5 for a correct link with an incorrect label, and 1 for an incorrect link.
The number of iterations was 5 for all languages.
For a sentence x and a parse tree y, we defined the feature representation by finding the sequence ((Si, I\), a\), ((S2,12) ... of states and their corresponding actions, and creating a feature vector for each state/action pair.
The discriminant function was thus written
where ip is a feature function that assigns a feature
Algorithm 1 The Online PA Algorithm
input Training set T = {(xt, yt)}J=1 Number of iterations N Regularization parameter C Cost function p Initialize w to zeros repeat N times for (xt,yt) inT
vector to a state (Si, Ii) and the action ai taken in that state.
Table 2 shows the feature sets used in 0 for all languages.
In principle, a kernel could also be used, but that would degrade performance severely.
Instead, we formed a new vector by combining features pairwisely - this is equivalent to using a quadratic kernel.
Since the history-based feature set used in the parsing algorithm makes it impossible to use independence to factorize the scoring function, an exact search to find the best-scoring action sequence (arg maxy in Algorithm 1) is not possible.
However, the beam search allows us to find a reasonable approximation.
4 Results
Table 3 shows the results of our system in the Multilingual task.
4.1 Compared to SVM-based Local Classifiers
We compared the performance of the parser with a parser based on local SVM classifiers (Johansson and Nugues, 2006).
Table 4 shows the performance of both parsers on the Basque test set.
We see that what is gained by using a global method such as OPA is lost by sacrificing the excellent classification performance of the SVM.
Possibly, better performance could be achieved by using a large-margin batch method such as sVMstruct.
Table 2: Feature sets.
Fine POS list
Features top Features list Features list-1 Features list+1 Features list+2 Word top Word top-1 Word list Word list-1 Word list+1 Lemma top Lemma list Lemma list-1 Relation top Relation top left Relation top right Relation list right Word top left Word top right
Word list left POS top left POS top right POS list left
Features top right Features first left
Table 3: Summary of results.
Languages
Unlabeled
Hungarian
Average result
Table 4: Accuracy by learning method.
Learning Method
To investigate the influence of the beam width on the performance, we measured the accuracy of a left-to-right parser on a development set for Basque (15% of the training data) as a function of the width.
Table 5 shows the result.
We see clearly that widening the beam considerably improves the figures, especially in the lower ranges.
Table 5: Accuracy by beam width.
We also investigated the contribution of the bidirectional parsing.
Table 6 shows the result of this experiment on the Basque development set (the same 15% as in 4.2).
The beam width was 2 in this experiment.
Table 6: Accuracy by parsing direction.
Direction
Accuracy
Left to right
Right to left
Bidirectional
Time did not allow a full-scale experiment, but for all languages except Catalan and Hungarian, the bidirectional parsing method outperformed the unidirectional methods when trained on a 20,000-word subset.
However, the gain of using bidirectional parsing may be more obvious when the treebank is small.
For all languages except Czech, left-to-right outperformed right-to-left parsing.
5 Discussion
The paper describes an incremental parser that we trained to minimize the cost over sentences, rather than over parsing actions as is usually done.
It was trained using the Online Passive-Aggressive method, a cost-sensitive online margin-based learning method, and shows reasonable performance and received above-average scores for most languages.
The performance of the parser (relative the other teams) was best for Basque and Turkish, whichwere two of the smallest treebanks.
Since we found that the optimal number of iterations was 5 for Basque (the smallest treebank), we used this number for all languages since we did not have time to investigate this parameter for the other languages.
This may have had a detrimental effect for some languages.
We think that some of the figures might be squeezed slightly higher by optimizing learning parameters and feature sets.
This work shows that it was possible to combine approaches used by Nivre's and McDonald's parsers in a single system.
While the parser is outperformed by a system based on local classifiers, we still hope that the parsing and training combination described here opens new ways in parser design and eventually leads to the improvement of parsing performance.
Acknowledgements
