Deterministic parsing has emerged as an effective alternative for complex parsing algorithms which search the entire search space to get the best probable parse tree.
In this paper, we present an online large margin based training framework for deterministic parsing using Nivre's Shift-Reduce parsing algorithm.
Online training facilitates the use of high dimensional features without creating memory bottlenecks unlike the popular SVMs.
We participated in the CoNLL Shared Task-2007 and evaluated our system for ten languages.
We got an average multilingual labeled attachment score of 74.54 % (with 65.50% being the average and 80.32% the highest) and an average multilingual unlabeled attachment score of 80.30% (with 71.13% being the average and 86.55% the highest).
1 Introduction
CoNLL-X had a shared task on multilingual dependency parsing (Buchholz et al., 2006) by providing treebanks for 13 languages in the same dependency format.
A look at the performance sheet in the contest shows that two systems with quite different approaches (one using deterministic parsing with SVM and the other using MIRA with nondeterministic and dynamic programming based MST approach ) performed with good results (McDonald et al., 2006;
Nivre et al., 2006).
More recently, deterministic parsing has generated a lot of interest because of their simplicity
(Nivre, 2003).
One of the main advantages of deterministic parsing lies in the ability to use the subtree information in the features to decide the next step.
Parsing algorithms which search the entire space (Eisner, 1996; McDonald, 2006) are restricted in the features they use to score a relation.
They rely only on the context information and not the history information to score a relation.
Using history information makes the search intractable.
Whereas, since deterministic parsers are at worst O(n2) (Yamada and Matsumoto, 2003) (Nivre (2003) is only O(2n) in the worst case), they can use the crucial history information to make parsing decisions.
So, in our work Nivre's parsing algorithm has been used to arrive at the dependency parse tree.
Popular learning algorithms for deterministic parsing like Support Vector Machines (SVM) run into memory issues for large data since they are batch learning algorithms.
Though more information is available in deterministic parsing in terms of subtree information, high dimensional features can't be used due to the large training times for SVMs.
This is where online methods come into the picture.
Unlike batch algorithms, online algorithms consider only one training instance at a time when optimizing parameters.
This restriction to single-instance optimization might be seen as a weakness, since the algorithm uses less information about the objective function and constraints than batch algorithms.
However, McDonald (2006) argues that this potential weakness is balanced by the simplicity of online learning, which allows for more streamlined training methods.
This work focuses purely on online learning for deterministic parsing.
In the remaining part of the paper, we introduce Nivre's parsing algorithm, propose a framework for online learning for deterministic parsing and present the results for all the languages with various feature models.
2 Parsing Algorithm
We used Nivre's top-down/bottom-up linear time parsing algorithm proposed in Nivre (2003).
A parser configuration is represented by triples (S, I, E) where S is the stack (represented as a list), I is the list of (remaining) tokens and E is the set of edges for the dependency graph D. S is a list of partially processed tokens, whose subtrees are incomplete i.e tokens whose parent or children have not yet been established. top is the top of the stack S, next is the next token in the list I.
Nivre's algorithm consists of four elementary actions Shift, Left, Right and Reduce to build the dependency tree from the initial configuration (nil, W, 0), where W is the input sentence.
Shift pushes next onto the stack S. Reduce pops the stack.
Right adds an arc from top to next and pushes next onto the stack S. Left adds an arc from next to top and pops the stack.
The parser terminates when it reaches a configuration (S, nil, E) (for any list S and set of edges E).
The labels for each relation are determined after a new arc is formed (by left and right actions).
The parser always constructs a dependency graph that is acyclic and projective.
For non-projective parsing, we followed the pseudo projective parsing approach proposed by Nivre and Nilson (2005).
In this approach, the training data is projectivized by a minimal transformation, lifting non-projective arcs one step at a time, and extending the arc label of the lifted arcs using the encoding scheme called HEAD+PATH.
The non-projective arcs can be recovered by applying an inverse transformation to the output of the parser, using a left-to-right, top-down, breadth-first search, guided by the extended arc labels.
This method has been used for all the languages.
3 Online Learning
McDonald (2005) applied online learning by scoring edges in a connected graph and finding the Maxi-
mum Spanning Tree (MST) of the graph.
McDonald et al. (2005) used Edge Based Factorization , where the score of a dependency tree is factored as the sum of scores of all edges in the tree.
Let, x = x1 • • • xn represents a generic input sentence , and y represents a generic dependency tree for sentence x. e y denotes the presence of a dependency relation in y from word xi (parent) to word xj (child).
In Nivre's parsing algorithm the dependency graph can be viewed as a graph resulting from a set of parsing decisions (in this case Shift, Reduce , Left & Right) made, starting with the initial configuration (nil, W, 0) .
We define this sequence of parsing decisions as d = di • • • dm. So, d is the sequence of parsing decisions made by the parser to obtain a dependency tree y, from an input sentence x. Lets also define c = c1 • • • cm to be the configuration sequence starting from initial configuration (nil, W, 0) to the final configuration (S, nil, E).
We define the score of a parsing decision for a particular configuration to be the dot product between a high dimensional feature vector (based on both the decision and the configuration) and a weight vector.
So,
where ci is the configuration at the ith instance and di is any one of the four actions {Shift, Reduce, Left, Right} .
The Margin Infused Relaxed Algorithm (MIRA) proposed by Crammer et al. (2003) attempts to keep the norm of the change to the parameter vector as small as possible, subject to correctly classifying the instance under consideration with a margin at least as large as the loss of the incorrect classifications.
McDonald et al. (2005) defines the loss of a dependency tree inferred by finding the Maximum Spanning Tree(MST), as the number of words that have incorrect parent (i.e the no. of edges that have gone wrong).
This satisfies the global constraint that the correct set of edges will have the highest weight.
However, in Nivre's algorithm, as there is no one to one correspondence between parsing decisions and the graph edges, the number of errors in the edges can't be used as a loss function as it won't reflect the exact loss in the parsing decisions.
In this method of calculating the loss function based on edges, we first get the series of decisions through inference on
the training data, then concat their feature vectors and finally run the normal updates with the edge based loss (since the resulting decisions will produce a parse tree).
This method gave very poor results.
So we do a factored MIRA for Nivre's algorithm by factoring the output by decisions to obtain the following constraints:
where di represents the correct decision and dd i represents all the other decisions for the same configuration ci.
This states that the weight of the correct decision for a particular configuration and the weight of all other decisions must be separated by a margin of 1.
For every sentence in the training data, starting with the initial configuration (nil, W, 0), weights are adjusted to satisfy the above constraints before proceeding to the next correct configuration.
This process is repeated till we reach the final configuration (S, nil, E).
4 Features
The two central elements in any configuration are the token on the top of the stack (t) and the next input token (n),the tokens which may be connected by a dependency relation in the current configuration.
We categorize our features into basic, context, history and in — between feature sets.
The basic feature set contains information about these two tokens t and n. This includes unigram, bigram combinations of the word forms (FORM), root word (LEMMA), features (FEATS) and the part-of-speech tags (both CPOS and POS) of these words.
The coarse POS tag (CPOS) is useful and helps solve data sparseness to some extent.
tags of these words are part of this context feature set.
We also included the second topmost element in the stack (st — 1) word too.
The third feature set, which is the history feature set contains the info about the subtree at a particular parser state.
One of the advantages of using deterministic parsing algorithm over nondeterminis-tic algorithm is that history can be used as features.
History features have information about the Parent (par), Left Sibling (ls) and Right Sibling (rs) of t. Unigram and trigram combinations (with t and n) of
POS, CPOS, DEPREL tags of these words are included in the History Features.
The features in the in — between feature set take the form of POS and CPOS trigrams: the POS/CPOS of t, that of the word in between and that of n.
All the features in these feature sets are conjoined with distance between t & n and the parsing decision.
We experimented with a combination of these feature sets in our training.
We define feature models f1, f2 and f3 for our experiments. f1 is a combination on basic and context feature sets. f2 is a mixture of basic, context and in — between feature sets whereas f3 contains basic, context and history feature sets.
The feature models (f)1-3 are the same for all the languages.
5 Results and Discussion
The system with online learning and Nivre's parsing algorithm was trained on the data released by CoNLL Shared Task Organizers for all the ten languages (Hajic et al., 2004; Aduriz et al., 2003; Marti et al., 2007; Chen et al., 2003; Bohmova et al., 2003; Marcus et al., 1993; Johansson and Nugues, 2007; Prokopidis et al., 2005; Csendes et al., 2005; Mon-temagni et al., 2003; Oflazer et al., 2003).
We evaluated our system using the standard evaluation script provided by the organizers (Nivre et al., 2007).
The evaluation metrics are Unlabeled Attachment Score(UAS) and Labeled Attachment Score(LAS).
The results of our system with various feature models are listed in Table 11.
The history information in f 3 contributed to a marginal improvement in accuracy of Hungarian, Italian and Turkish.
Whereas, Arabic, Catalan, Czech, English, Greek
1Results aren't available for the models with a '-' mark.
Language
Hungarian
got their highest accuracies with feature model f 2 containing basic, context and in— between feature sets.
The rest of the languages, Basque and Chinese achieved highest accuracies with f 1 .
But, a careful look at the results table shows that there isn't any significant difference in the accuracies of the system across different feature models.
This is true for all the languages.
The feature models f 2 and f 3 did not show any significant difference in accuracies even though they contain more information.
Feature model f 1 with basic and context feature sets has achieved good accuracies.
5.1 K-Best Deterministic Parsing
The deterministic parsing algorithm does not handle ambiguity.
By choosing a single parser action at each opportunity, the input string is parsed determin-istically and a single dependency tree is built during the parsing process from beginning to end (no other trees are even considered).
A simple extension to this idea is to eliminate determinism by allowing the parser to choose several actions at each opportunity, creating different action sequences that lead to different parse trees.
Since a score is assigned to every parser action, the score of a parse tree can be computed simply as the average of the scores of all actions that resulted in that parse tree (the derivation of the tree).
We performed a beam search by carrying out a K-best search through the set of possible sequences of actions as proposed by Johansson and Nugues (2006).
However, this did not increase the accuracy.
Moreover, with larger values of K, there was a decrease in the parsing accuracy.
The best-
first search proposed by Sagae and Lavie (2006) was also tried out but there was similar drop in accuracy.
6 Conclusion
The evaluation shows that the labeled pseudo projec-tive deterministic parsing with online learning gives competitive parsing accuracy for most of the languages involved in the shared task.
The level of accuracy varies considerably between the languages.
Analyzing the results and the effects of various features with online learning will be an important research goal in the future.
