We present our system used in the CoNLL 2007 shared task on multilingual parsing.
The system is composed of three components: a k-best maximum spanning tree (MST) parser, a tree labeler, and a reranker that orders the k-best labeled trees.
We present two techniques for training the MST parser: tree-normalized and graph-normalized conditional training.
The tree-based reranking model allows us to explicitly model global syntactic phenomena.
We describe the reranker features which include non-projective edge attributes.
We provide an analysis of the errors made by our system and suggest changes to the models and features that might rectify the current system.
1 Introduction
Reranking the output of a k-best parser has been shown to improve upon the best results of a state-of-the-art constituency parser (Charniak and Johnson, 2005).
This is primarily due to the ability to incorporate complex structural features that cannot be modeled under a CFG.
Recent work shows that k-best maximum spanning tree (MST) parsing and reranking is also viable (Hall, 2007).
In the current work, we explore the k-best MST parsing paradigm along with a tree-based reranker.
A system using the parsing techniques presented in this paper was entered in the CoNLL 2007 shared task competition (Nivre et al., 2007).
This task evaluated parsing performance on 10 languages: Arabic, Basque,
Catalan, Chinese, Czech, English, Greek, Hungarian, Italian, and Turkish using data originating from a wide variety of dependency treebanks, and transformations of constituency-based treebanks (HajiC et al., 2004; Aduriz et al., 2003; Marti et al., 2007; Chen et al., 2003; Bohmova et al., 2003; Marcus et al., 1993; Johansson and Nugues, 2007; Prokopidis et al., 2005; Csendes et al., 2005; Montemagni et al., 2003;Oflazer et al., 2003).
We show that oracle parse accuracy1 of the output of our k-best parser is generally higher than the best reported results.
We also present the results of a reranker based on a rich set of structural features, including features explicitly targeted at modeling non-projective configurations.
Labeling of the dependency edges is accomplished by an edge labeler based on the same feature set as used in training the k-best MST parser.
2 Parser Description
Our parser is composed of three components: a k-best MST parser, a tree-labeler, and a tree-reranker.
Log-linear models are used for each of the components independently.
In this section we give an overview of the models, the training techniques, and the decoders.
The connection between the maximum spanning tree problem and dependency parsing stems from the observation that a dependency parse is simply an oriented spanning tree on the graph of all possible
1The oracle accuracy for a set of hypotheses is the maximal accuracy for any of the hypotheses.
dependency links (the fully connected dependency graph).
Unfortunately, by mapping the problem to a graph, we assume that the scores associated with edges are independent, and thus, are limited to edge-factored models.
Edge-factored models are severely limited in their capacity to predict structure.
In fact, they can only directly model parent-child links.
In order to alleviate this, we use a k-best MST parser to generate a set of candidate hypotheses.
Then, we rerank these trees using a model based on rich structural features that model features such as valency, subcategoriza-tion, ancestry relationships, and sibling interactions, as well as features capturing the global structure of dependency trees, aimed primarily at modeling language specific non-projective configurations.
We assign dependency labels to entire trees, rather than predicting the labels during tree construction.
Given that we have a reranking process, we can label the k-best tree hypotheses output from our MST parser, and rerank the labeled trees.
We have explored both labeled and unlabeled reranking.
In the latter case, we simply label the maximal unlabeled tree.
McDonald et al. (2005) present a technique for training discriminative models for dependency parsing.
The edge-factored models we use for MST parsing are closely related to those described in the previous work, but allow for the efficient computation of normalization factors which are required for first and second-order (gradient-based) training techniques.
We consider two estimation procedures for parent-prediction models.
A parent-prediction model assigns a conditional score s(g\d) for every parent-child pair (we denote the parent/governor g, and the child/dependent d), where s(g\d) = s(g,d)/Ylg' s(g',d).
In our work, we compute probabilities p(g\d) based on conditional log-linear models.
This is an approximation to a generative model that predicts each node once (i.e., nd p(d\g)).
In the graph-normalized model, we assume that the conditional distributions are independent of one another.
In particular, we find the model parameters that maximize the likelihood of p(g*\d), where g* is the correct parent in the training data.
We per-
form the optimization over the entire training set, tying the feature parameters.
In particular, we perform maximum entropy (MaxEnt) estimation over the conditional distribution using second-order gradient descent optimization techniques.2 An advantage of the parent-prediction model is that we can frame the estimation problem as that of minimum-error training with a zero-one loss term:
where e G {0,1} is the error term (e is 1 for the correct parent and 0 for all other nodes) and Zd = Ej exp(Xi Xifi(ej,gj, d)) is the normalization constant for node d. Note that the normalization factor considers all graphs with in-degree zero for the root node and in-degree one for other nodes.
At parsing time, of course, our parent predictions are constrained to produce a (non-projective) tree structure.
We can sum over all non-projective spanning trees by taking the determinant of the Kirchhoff matrix of the graph defined above, minus the row and column corresponding to the root node (Smith and Smith, 2007).
Training graph-normalized and tree-normalized models under identical conditions, we find tree normalization wins by 0.5% to 1% absolute dependency accuracy.
Although tree normalization also shows a (smaller) advantage in k-best oracle accuracy, we do not believe it would have a large effect on our reranking results.
The reranker is based on a conditional log-linear model subject to the MaxEnt constraints using the same second-order optimization procedures as the graph-normalized MST models.
The primary difference here is that there is no single correct tree in the set of k candidate parse trees.
Instead, we have k trees that are generated by our k-best parser, each with a score assigned by the parser.
If we are performing labeled reranking, we label each of these hypotheses with l possible labelings, each with a score assigned by the labeler.
As with the parent-prediction, graph-normalized model, we perform minimum-error training.
The
2For the graph-normalized models, we use L-BFGS optimization provided through the TAO/PETSC optimization library (Benson et al., 2005; Balay et al., 2004).
optimization is achieved by assuming the oracle-best parse(s) are correct and the remaining hypotheses are incorrect.
Furthermore, the feature values are scaled according to the relative difference between the oracle-best score and the score assigned to the non-oracle-best hypothesis.
Note that any reranker could be used in place of our current model.
We have chosen to keep the reranker model closely related to the MST parsing model so that we can share feature representations and training procedures.
We used the same edge features to train a separate log-linear labeling model.
Each edge feature was conjoined with a potential label, and we then maximized the likelihood of the labeling in the training data.
Since this model is also edge-factored, we can store the labeler scores for each of the n2 potential edges in the dependency tree.
In the submitted system, we simply extracted the Viterbi predictions of the labeler for the unlabeled trees selected by the reranker.
We also (see below) ran experiments where each entry in the k-best lists input as training data to the reranker was augmented by its l-best la-belings.
We hoped thereby to inject more diversity into the resulting structures.
Our MST models are based on the features described in (Hall, 2007); specifically, we use features based on a dependency nodes' form, lemma, coarse and fine part-of-speech tag, and morphological-string attributes.
Additionally, we use surface-string distance between the parent and child, buckets of features indicating if a particular form/lemma/tag occurred between or next to the parent and child, and a branching feature indicating whether the child is to the left or right of the parent.
Composite features, combining the above features are also included (e.g., a single feature combining branching, parent & child form, parent & child tag).
The tree-based reranker includes the features described in (Hall, 2007) as well as features based on non-projective edge attributes explored in (Havelka, 2007a; Havelka, 2007b).
One set of features models relationships of nodes with their siblings, including valency and subcategorization.
A second
set of features models global tree structure and includes features based on a node's ancestors and the depth and size of its subtree.
A third set of features models the interaction of word order and tree structure as manifested on individual edges, i.e., the features model language specific projective and non-projective configurations.
They include edge-based features corresponding to the global constraints of projectivity, planarity and well-nestedness, and for non-projective edges, they furthermore include level type, level signature and ancestor-in-gap features.
All features allow for an arbitrary degree of lexical-ization; in the reported results, the first two sets of features use coarse and fine part-of-speech lexical-izations, while the features in the third set are used in their unlexicalized form due to time limitations.
3 Results and Analysis
Hall (2007) shows that the oracle parsing accuracy of a k-best edge-factored MST parser is considerably higher than the one-best score of the same parser, even when k is small.
We have verified that this is true for the CoNLL shared-task data by evaluating the oracle rates on a randomly sampled development set for each language.
In order to select optimal model parameters for the MST parser, the labeler, and reranker, we sampled approximately 200 sentences from each training set to use as a development test set.
Training the reranker requires a jackknife n-fold training procedure where n — 1 partitions are used to train a model that parses the remaining partition.
This is done n times to generate k-best parses for the entire training set without using models trained on the data they are run on.
For lack of space, we report only results on the CoNLL evaluation data set here, but note that the trends observed on the evaluation data are identical to those observed on our development sets.
In Table 1 we present results for labeled (and un-labeled) dependency accuracy on the CoNLL 2007 evaluation data set.
We report the oracle accuracy for different sized k-best hypothesis sets.
The columns are labeled by the number of trees output from the MST parser, k;3 and by the number of al-
3All results are reported for the graph-normalized training technique.
Language
Oracle Accuracy
Reranked
Reported
Hungarian
Table 1: Labeled (unlabeled) attachment accuracy for k-best MST oracle results and reranked data on the evaluation set.
The 1-best results (k = 1, l = 1) represent the performance of the MST parser without reranking.
The New Reranked field shows recent unlabeled reranking results of 50-best trees using a modified feature set.
For arabic, we only report unlabeled accuracy for different k and l.
ternative labelings for each tree, l. When k = 1, the score is the best achievable by the edge-factored MST parser using our models.
As k increases, the oracle parsing accuracy increases.
The most extreme difference between the one-best accuracy and the 50-best oracle accuracy can be seen for Turkish where there is a difference of 9.64 points of accuracy (8.77 for the unlabeled trees).
This means that the reranker need only select the correct tree from a set of 50 to increase the score by 9.64%.
As our reranking results show, this is not as simple as it may appear.
We report the results for our CoNLL submission as well as recent results based on alternative parameters optimization on the development set.
We report the latest results only for unlabeled accuracy of reranking 50-best MST output.
4 Conclusion
Our submission to the CoNLL 2007 shared task on multilingual parsing supports the hypothesis that edge-factored MST parsing is viable given an effective reranker.
The reranker used in our submission was unable to achieve the oracle rates.
We believe this is primarily related to a relatively impoverished feature set.
Due to time constraints, we have not been able to train lexicalized reranking models.
The introduction of lexicalized features in the reranker should influence the selection of better trees, which we know exist in the k-best hypothesis sets.
